Datasaur
Search…
Supported Formats
Sources: Wikipedia and IANA
This page details all the supported Datasaur formats, provides examples for each format and clarifies expected file structure where appropriate.

TSV

A TSV (tab-separated values) file is a simple text format for storing data in a tabular structure. A TSV file encodes a number of records that may contain multiple fields.
  • Each record is represented as a single line.
  • Each field
    value is represented as text.
  • Fields in a record are separated from
    one other by the tab character .
    • Note that because is a special character for this format, fields that contain tabs are not allowed in this encoding.
  • The header (first) line of this encoding contains the name of
    each field, separated by tabs.

Example

1
Book Title Author Genre
2
Sherlock Holmes: A Study in Scarlet Sir Arthur Conan Doyle Fiction
3
To Kill a Mockingbird Harper Lee Fiction
4
Alan Turing: The Enigma Andrew Hodges Non fiction
5
Humble Pie Gordon Ramsay Non fiction
6
The Little Prince Antoine de Saint-Exupéry Fiction
Copied!

IOB (specialized .tsv)

IOB (inside, outside, beginning) is a common labeling format for labeling tokens in computational linguistics (ex: named-entity recognition). IOB is also a .tsv, but conforms to the following rules:
  • The B- prefix before a tag indicates that the tag is the beginning of a chunk.
  • The I- prefix before a tag indicates that the tag is inside a chunk.
  • The B- tag is used only when a tag is followed by a tag of the same type without O tokens between them.
  • The O tag indicates that a token does not belong to a chunk.

Example

1
Sherlock B-PER
2
Holmes I-PER
3
become
4
widely
5
popular
6
in
7
1891 YEAR
8
.
Copied!

CSV

A CSV (comma-separated values) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.

Example

1
Book Title,Author,Genre
2
Sherlock Holmes: A Study in Scarlet,Sir Arthur Conan Doyle,Fiction
3
To Kill a Mockingbird,Harper Lee,Fiction
4
Alan Turing: The Enigma,Andrew Hodges,Non fiction
5
Humble Pie,Gordon Ramsay,Non fiction
6
The Little Prince,Antoine de Saint-Exupéry,Fiction
Copied!

XLSX

​XLSX is a well-known format for Microsoft Excel documents that was introduced by Microsoft with the release of Microsoft Office 2007. An XLSX file is also typically used to store tabular data.

Example

đź’ˇYou may find a sample file here.

JSON

​JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value).
A JSON file may contain the following data structures:
  • An object is an unordered set of name/value pairs.
    • An object begins with {left brace and ends with }right brace. Each name is followed by :colon and the name/value pairs are separated by ,comma.
  • An array is an ordered collection of values.
    • An array begins with [left bracket and ends with ]right bracket. Values are separated by ,comma.
  • A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested.
  • A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. A character is represented as a single character string. A string is very much like a C or Java string.
  • A number is like a C or Java number, except that the octal and hexadecimal formats are not used.
  • Whitespace can be inserted between any pair of tokens. Excepting a few encoding details, that completely describes the language.
In the example below, here are the objects recognized at Datasaur.
  • text: the sentence.
  • entity.text: the token.
  • type: the label applied.
  • start_idx: the character position in the labeled token.
    • The character position uses zero-based index.
  • end_idx: the last character position + 1 (because end_index does not include the last character).
    • The character position uses zero-based index.
đź’ˇ "entities": [] value is required for now. This is to distinguish between our token-based JSON format and free-form JSON format. We will make it optional in the future.

Example

1
[
2
{
3
"text": "The new series Narcos created by Chris Brancato , Eric Newman and Carlo Bernard , represents a pretty ambitious step for Netflix .",
4
"entities": [
5
{
6
"text": "Narcos",
7
"type": "TITLE",
8
"start_idx": 15,
9
"end_idx": 21
10
},
11
{
12
"text": "Chris Brancato",
13
"type": "PER",
14
"start_idx": 33,
15
"end_idx": 47
16
},
17
{
18
"text": "Carlo Bernard",
19
"type": "PER",
20
"start_idx": 66,
21
"end_idx": 79
22
},
23
{
24
"text": "Eric Newman",
25
"type": "PER",
26
"start_idx": 50,
27
"end_idx": 61
28
},
29
{
30
"text": "Netflix",
31
"type": "ORG",
32
"start_idx": 121,
33
"end_idx": 128
34
}
35
]
36
}
37
]
Copied!

JSON_TABULAR

JSON_TABULAR is a derivative of the JSON format that is used to represent table data format (in the form of an array of objects). You can choose this format if you are working on row-based labeling.

Example

1
[
2
{
3
"Book Title": "Sherlock Holmes: A Study in Scarlet",
4
"Author": "Sir Arthur Conan Doyle",
5
"Genre": "Fiction"
6
},
7
{
8
"Book Title": "To Kill a Mockingbird",
9
"Author": "Harper Lee",
10
"Genre": "Fiction"
11
},
12
{
13
"Book Title": "Alan Turing: The Enigma",
14
"Author": "Andrew Hodges",
15
"Genre": "Non fiction"
16
},
17
{
18
"Book Title": "Humble Pie",
19
"Author": "Gordon Ramsay",
20
"Genre": "Non fiction"
21
},
22
{
23
"Book Title": "The Little Prince",
24
"Author": "Antoine de Saint-Exupéry",
25
"Genre": "Fiction"
26
}
27
]
Copied!

TSV_NON_IOB

TSV_NON_IOB is a derivative of the TSV format that represents data that does not follow the IOB format - for example, B-GEO is just GEO. If your project is token-based (with or without arrows), you can choose this format for export.
A TSV_NON_IOB file contains the following data structure (this explanation is based on our example below):
  1. 1.
    #FORMAT: the file header.
  2. 2.
    #Text: the sentence representation.
  3. 3.
    1-1: the sentence-token.
    1. 1.
      The first 1 indicates the sentence number.
    2. 2.
      The second 1 indicates the token number.
  4. 4.
    0-3: the character index.
  5. 5.
    TITLE[1]: the label applied.
    1. 1.
      [1] indicates uniquely identify annotation across lines.
  6. 6.
    Column 5: indicates layer 2.
  7. 7.
    author[2-1]: the label on the arrow.
    1. 1.
      2 indicates the arrow’s token origin.
    2. 2.
      1 indicates the arrow’s token destination.
  8. 8.
    Column 7: indicates layer 4.
  9. 9.
    Column 8: indicates layer 5.
Note: column 5, 7, and 8 will be filled if you label the token in the mentioned layers.
_\_💡_We built this format to be compatible with [_WebAnno](https://webanno.github.io/webanno/releases/3.4.5/docs/user-guide.html#sect_webannotsv)_._​

Example (token-based)

1
#FORMAT=Datasaur TSV 3
2
​
3
#Text=The Little Prince is a novella by French aristocrat , writer , and aviator Antoine de Saint - Exupéry .
4
1-1 0-3 The TITLE[1] _ _ _ _
5
1-2 4-10 Little TITLE[1] _ _ _ _
6
1-3 11-17 Prince TITLE[1] _ _ _ _
7
1-4 18-20 is _ _ _ _ _
8
1-5 21-22 a _ _ _ _ _
9
1-6 23-30 novella _ _ _ _ _
10
1-7 31-33 by _ _ _ _ _
11
1-8 34-40 French _ _ _ _ _
12
1-9 41-51 aristocrat _ _ _ _ _
13
1-10 52-53 , _ _ _ _ _
14
1-11 54-60 writer _ _ _ _ _
15
1-12 61-62 , _ _ _ _ _
16
1-13 63-66 and _ _ _ _ _
17
1-14 67-74 aviator _ _ _ _ _
18
1-15 75-82 Antoine PER[2] _ _ _ _
19
1-16 83-85 de PER[2] _ _ _ _
20
1-17 86-91 Saint PER[2] _ _ _ _
21
1-18 92-93 - PER[2] _ _ _ _
22
1-19 94-101 Exupéry PER[2] _ _ _ _
23
1-20 102-103 . _ _ _ _ _
Copied!

Example (token-based with arrows)

1
#FORMAT=Datasaur TSV 3
2
​
3
#Text=The Little Prince is a novella by French aristocrat , writer , and aviator Antoine de Saint - Exupéry .
4
1-1 0-3 The TITLE[1] _ * author[2_1] _
5
1-2 4-10 Little TITLE[1] _ _ _ _
6
1-3 11-17 Prince TITLE[1] _ _ _ _
7
1-4 18-20 is _ _ _ _ _
8
1-5 21-22 a _ _ _ _ _
9
1-6 23-30 novella _ _ _ _ _
10
1-7 31-33 by _ _ _ _ _
11
1-8 34-40 French _ _ _ _ _
12
1-9 41-51 aristocrat _ _ _ _ _
13
1-10 52-53 , _ _ _ _ _
14
1-11 54-60 writer _ _ _ _ _
15
1-12 61-62 , _ _ _ _ _
16
1-13 63-66 and _ _ _ _ _
17
1-14 67-74 aviator _ _ _ _ _
18
1-15 75-82 Antoine PER[2] _ _ _ _
19
1-16 83-85 de PER[2] _ _ _ _
20
1-17 86-91 Saint PER[2] _ _ _ _
21
1-18 92-93 - PER[2] _ _ _ _
22
1-19 94-101 Exupéry PER[2] _ _ _ _
23
1-20 102-103 . _ _ _ _ _
Copied!

CoNLL-U

Universal Dependencies use a revised version of the CoNLL-X format called CoNLL-U. Sentences consist of one or more word lines, and word lines contain the following fields:
  1. 1.
    sent_id: Sentence id.
  2. 2.
    text: Sentence.
  3. 3.
    ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
  4. 4.
    FORM: Word form or punctuation symbol.
  5. 5.
    LEMMA: Lemma or stem of word form.
  6. 7.
    XPOS: Language-specific part-of-speech tag; underscore if not available.
  7. 8.
    FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
  8. 9.
    HEAD: Head of the current word, which is either a value of ID or zero (0).
  9. 10.
    DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
  10. 11.
    DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
  11. 12.
    MISC: Any other annotation.

Example

1
# sent_id = 1
2
# text = Sherlock Holmes become widely popular in 1891 .
3
1 Sherlock _ _ NNP _ 3 nsubj _ _
4
2 Holmes _ _ NNP _ _ _ _ _
5
3 become _ _ VBD _ 0 root _ _
6
4 widely _ _ RB _ 5 advmod _ _
7
5 popular _ _ JJ _ 3 xcomp _ _
8
6 in _ _ IN _ _ _ _ _
9
7 1891 _ _ CD _ _ _ _ _
10
8 . _ _ . _ _ _ _ _
Copied!

CoNLL

​CoNLL is also a .tsv, and targets semantic role labeling. Columns are tab-separated. Sentences are separated by a blank new line.

Example

1
1 Sherlock _ _ _ _ 0 _ _ _
2
2 Holmes _ _ _ _ 0 _ _ _
3
3 become _ _ _ _ 0 _ _ _
4
4 widely _ _ _ _ 0 _ _ _
5
5 popular _ _ _ _ 0 _ _ _
6
6 in _ _ _ _ 0 _ _ _
7
7 1891 _ _ YEAR _ 0 _ _ _
8
8 . _ _ _ _ 0 _ _ _
Copied!

CoNLL_2003

​CoNLL_2003 is usually used for POS tagging and named entity recognition labeling. All data files contain one word per line with empty lines representing sentence boundaries. At the end of each line there is a tag which states whether the current word is inside a named entity or not. The tag also encodes the type of named entity. Each line contains four fields:
  1. 1.
    The word
  2. 2.
    Part of-speech tag
  3. 3.
    Chunk tag
  4. 4.
    Named entity tag
Note: Importing or exporting files with conll_2003 format can be done if you checked the following task settings.
  • Tokens and token spans should have at most one label.
  • Allow arrows to be drawn between labels. Checking this setting will activate layer feature.
You could do POS tagging on Layer 0 and NER tagging on Layer 1. If you export the file with conll_2003, the result will be as shown as example below.

Example

1
Sherlock NNP B-Person O
2
Holmes NNP I-Person O
3
become VB O O
4
widely RB O O
5
popular JJ O O
6
in IN O O
7
1891 CD B-Year O
8
. . O O
Copied!

JSON_ADVANCED

JSON_ADVANCED is a proprietary Datasaur format designed in collaboration with our users to capture all possible data. This format is commonly used for partial token labeling projects. You can also use it when exporting token-based with arrow projects, such as coreference and dependency.
A JSON_ADVANCED file may contain the following data structures:
  1. 1.
    Sentences field
    1. 1.
      id: the sentence position.
    2. 2.
      content: the text of the sentence.
    3. 3.
      tokens: the tokens form of the sentence.
    4. 4.
      labels
      1. 1.
        l: the label applied.
      2. 2.
        layer: the layer position of the labels. This field is reserved for a project where a labeling of multiple tag set at once. For now you can disregard this field and this field is always set to 0.
      3. 3.
        id: the unique identifier of a label.
        1. 1.
          If the id has 9 segments, this indicates span label. For example, INNM0ViFwo8LluMTaTIK9:0:0:14:0:0:18:6:0 and here's the explanation <label set item id>:<layer>:<sidS>:<s>:<charS>:<sidE>:<e>:<charE>:<index>.
        2. 2.
          If the id has 21 segments, this indicates arrow label. For example, tfc1FkbbEk9fOLx6haR1s:0:INNM0ViFwo8LluMTaTIK9:0:0:14:0:0:18:6:0:Oq_VuB0s_N7D8ZY0rgYsg:0:0:0:0:0:2:5:0:0 and here's the explanation <label set item id>:<arrow layer>:<….. origin id>:<….destination id>:<arrow index>.
      4. 4.
        hashCode: Datasaur's code to represent label information __
        1. 1.
          Span label. For example, SPAN:gpe:0:0:0:4:0:0:0:4:3:0:undefined:undefined. Below is the explanation:
          1. 1.
            type:label set item id:layer or label set index:start cell line:cell index:start token index: start char index: end cell line: end cell index: end token index: end char index: counter.
        2. 2.
          Arrow label. For example, ARROW:dyC-o1HBnn49dcqDSphmJ:1:0:0:0:0:0:0:10:6:0:SPAN:geo:0:0:0:0:0:0:0:0:4:0:undefined:undefined:SPAN:geo:0:0:0:10:0:0:0:10:6:0:undefined:undefined. Below is the explanation:
          1. 1.
            type:label set item id:layer or label set index:start cell line:cell index:start token index: start char index: end cell line: end cell index: end token index: end char index: counter:<span label: origin>:<span label: destination>.
      5. 5.
        documentId: the id of document.
      6. 6.
        sidS, sidE: the sentence starting and ending position of a label in 0-based index. In Datasaur, it is possible that a label spans across sentences.
      7. 7.
        s: the token starting position of a label in the starting sentence in 0-based index.
      8. 8.
        e: the token ending position of a label in the ending sentence in 0-based index.
      9. 9.
        charS: the character starting position of a label in the starting token in 0-based index.
      10. 10.
        charE: the character ending position of a label in the ending token in 0-based index.
    5. 5.
      metadata: additional information for a cell
  2. 2.
    labelerInfo: the information about the labeler.
    1. 1.
      id: the unique identifier of a labeler (each labeler has different id).
    2. 2.
      email: email that labeler used when signing in.
    3. 3.
      displayName: the display name of the email.
  3. 3.
    labelSet: contains all the label items that you used for the project.
    1. 1.
      index: the position of the label set in UI
    2. 2.
      labelItems: an array of labelItems for a label set
      • id: id of the labelSetItem
      • labelName: the displayed name of the label set item
      • parentId: id of the parent label set item
      • color: the color of the label set item
  4. 4.
    labels: an array of labels for the document
    1. 1.
      labelText: label content for row-based project. It will be null for other project beside the row-based project.
    2. 2.
      id: identifier from the applied label.
    3. 3.
      documentId : identifier for document where the label is applied.
    4. 4.
      startCellLine: starting line sentence position
    5. 5.
      startCellIndex: starting line column position
    6. 6.
      startTokenIndex: starting token index position
    7. 7.
      startCharIndex: starting character index position (relative to tokenIndex, start from 0 again when tokenIndex incremented)
    8. 8.
      endCellLine: ending line sentence position
    9. 9.
      endCellIndex: ending line column position
    10. 10.
      endTokenIndex: ending token index position
    11. 11.
      endCharIndex: ending character index position
    12. 12.
      layer: the layer where the token is positioned
    13. 13.
      counter: labels with the same name to be placed multiple times in the same position, start from 0
    14. 14.
      type: the type of labels -> SPAN, ARROW, BOUNDING_BOX
    15. 15.
      createdAt:
      1. 1.
        Labeler: the time labels applied
      2. 2.
        Reviewer: the time labels got accepted
    16. 16.
      updatedAt: last update timestamp on the label
    17. 17.
      Review related fields
      1. 1.
        acceptedByUserId: the user id of a reviewer who accepts the label. It will be null if there's no user who accept it manually.
      2. 2.
        rejectedByUserId: the user id of a reviewer who rejects the label. It will be null if there's no user who rejects it manually
      3. 3.
        labeledByUserId: the user id of a reviewer
      4. 4.
        labeledBy:
        • CONFLICT if it has not been resolved
        • REVIEWER if it has been resolved
        • AUTO if it has been resolved by meeting the consensus
    18. 18.
      Arrow label type specific fields
      1. 1.
        originId: origin id of an arrow label
      2. 2.
        originNumber: auto increment ID for origin
      3. 3.
        destinationId: origin id of an arrow label
      4. 4.
        destinationNumber: auto increment ID for destination
    19. 19.
      Bounding box label type specific fields
      1. 1.
        pageIndex: index of the page if the document contain multiple pages
      2. 2.
        nodeCount: total number of the bounding box points
      3. 3.
        x0: x coordinate of top left position of the bounding box
      4. 4.
        y0: y coordinate of top left position of the bounding box
      5. 5.
        x1: x coordinate of top right position of the bounding box
      6. 6.
        y1: y coordinate of top right position of the bounding box
      7. 7.
        x2: x coordinate of bottom right position of the bounding box
      8. 8.
        y2: y coordinate of bottom right position of the bounding box
      9. 9.
        x3: x coordinate of bottom left position of the bounding box
      10. 10.
        y3: y coordinate of bottom left position of the bounding box
    20. 20.
      pages: an array of page information for OCR project type
      1. 1.
        pageIndex: index of the page if the document contain multiple pages
      2. 2.
        pageHeight: original page height in pixel
      3. 3.
        pageWidth: original page width in pixel
  5. 5.
    comments
    1. 1.
      id: the id of the comment
    2. 2.
      parentId: the id of the parent comment - this will be filed if the comment thread has replies.
    3. 3.
      hashCode: Datasaur's code to represent comment's information, including the value being commented
    4. 4.
      message: the content of the comment
    5. 5.
      type: the type of comment, can be SPAN_LABEL,SPAN_TEXT, ARROW_LABEL, and CELL_LABEL
    6. 6.
      userId: the id of user who create the comment
    7. 7.
      createdAt: the time when the user create the comment

Example (token-based with arrow)

1
{
2
"sentences": [
3
{
4
"id": 0,
5
"content": "The Little Prince is a novella by French aristocrat , writer , and aviator Antoine de Saint - Exupéry .",
6
"tokens": ["The","Little","Prince","is","a","novella","by","French","aristocrat",",","writer",",","and","aviator","Antoine","de","Saint","-","Exupéry","."
7
],
8
"labels": [
9
{
10
"layer": 0,
11
"sidS": 0,
12
"s": 0,
13
"charS": 0,
14
"sidE": 0,
15
"e": 2,
16
"charE": 5,
17
"l": "vGOy0ZKA-2rqK7netKz9I",
18
"id": "vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:2:5:0",
19
"deleted": false,
20
"labeledBy": "LABELER",
21
"labeledByUserId": 752,
22
"hashCode": "vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:0:0:2:5:0:SPAN:undefined:undefined",
23
"documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
24
"comments": []
25
},
26
{
27
"layer": 1,
28
"sidS": 0,
29
"s": 0,
30
"charS": 0,
31
"sidE": 0,
32
"e": 18,
33
"charE": 6,
34
"l": "A-92mMT_WppaawOOBJbjt",
35
"id": "A-92mMT_WppaawOOBJbjt:1:9I-5oYKvnzJRWHgsZrDe_:0:0:14:0:0:18:6:0:vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:2:5:0:0",
36
"deleted": false,
37
"labeledBy": "LABELER",
38
"labeledByUserId": 752,
39
"hashCode": "A-92mMT_WppaawOOBJbjt:1:0:0:0:0:0:0:18:6:0:ARROW:9I-5oYKvnzJRWHgsZrDe_:0:0:0:14:0:0:0:18:6:0:SPAN:undefined:undefined:vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:0:0:2:5:0:SPAN:undefined:undefined",
40
"documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
41
"comments": []
42
},
43
{
44
"layer": 0,
45
"sidS": 0,
46
"s": 14,
47
"charS": 0,
48
"sidE": 0,
49
"e": 18,
50
"charE": 6,
51
"l": "9I-5oYKvnzJRWHgsZrDe_",
52
"id": "9I-5oYKvnzJRWHgsZrDe_:0:0:14:0:0:18:6:0",
53
"deleted": false,
54
"labeledBy": "LABELER",
55
"labeledByUserId": 752,
56
"hashCode": "9I-5oYKvnzJRWHgsZrDe_:0:0:0:14:0:0:0:18:6:0:SPAN:undefined:undefined",
57
"documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
58
"comments": []
59
}
60
]
61
}
62
],
63
"labelSets": [
64
{
65
"labelItems": [
66
{
67
"id": "vGOy0ZKA-2rqK7netKz9I",
68
"labelName": "Novel",
69
"parentId": null,
70
"color": null
71
},
72
{
73
"id": "9I-5oYKvnzJRWHgsZrDe_",
74
"labelName": "Male",
75
"parentId": null,
76
"color": null
77
}
78
]
79
},
80
{
81
"labelItems": [
82
{
83
"id": "A-92mMT_WppaawOOBJbjt",
84
"labelName": "Author",
85
"parentId": null,
86
"color": null
87
}
88
]
89
}
90
],
91
"labels": [
92
{
93
"labelText": null,
94
"id": "508985812",
95
"documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
96
"labeledByUserId": 752,
97
"startCellIndex": 0,
98
"startCellLine": 0,
99
"startTokenIndex": 0,
100
"startCharIndex": 0,
101
"endCellIndex": 0,
102
"endCellLine": 0,
103
"endTokenIndex": 2,
104
"endCharIndex": 5,
105
"layer": 0,
106
"counter": 0,
107
"labeledBy": "LABELER",
108
"acceptedByUserId": null,
109
"rejectedByUserId": null,
110
"originId": null,
111
"originNumber": "0",
112
"destinationId": null,
113
"destinationNumber": "0",
114
"type": "SPAN",
115
"labelSetItemId": "vGOy0ZKA-2rqK7netKz9I",
116
"status": "ACCEPTED",
117
"createdAt": "2021-09-03T08:13:38.262Z",
118
"updatedAt": "2021-09-03T08:13:38.330Z"
119
},
120
{
121
"labelText": null,
122
"id": "508985908",
123
"documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
124
"labeledByUserId": 752,
125
"startCellIndex": 0,
126
"startCellLine": 0,
127
"startTokenIndex": 0,
128
"startCharIndex": 0,
129
"endCellIndex": 0,
130
"endCellLine": 0,
131
"endTokenIndex": 18,
132
"endCharIndex": 6,
133
"layer": 1,
134
"counter": 0,
135
"labeledBy": "LABELER",
136
"acceptedByUserId": null,
137
"rejectedByUserId": null,
138
"originId": "508985830",
139
"originNumber": "508985830",
140
"destinationId": "508985812",
141
"destinationNumber": "508985812",
142
"type": "ARROW",
143
"labelSetItemId": "A-92mMT_WppaawOOBJbjt",
144
"status": "ACCEPTED",
145
"createdAt": "2021-09-03T08:14:05.307Z",
146
"updatedAt": "2021-09-03T08:14:05.397Z",
147
"origin": {
148
"labelText": null,
149
"id": "508985830",
150
"documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
151
"labeledByUserId": 752,
152
"startCellIndex": 0,
153
"startCellLine": 0,
154
"startTokenIndex": 14,
155
"startCharIndex": 0,
156
"endCellIndex": 0,
157
"endCellLine": 0,
158
"endTokenIndex": 18,
159
"endCharIndex": 6,
160
"layer": 0,
161
"counter": 0,
162
"labeledBy": "LABELER",
163
"acceptedByUserId": null,
164
"rejectedByUserId": null,
165
"originId": null,
166
"originNumber": "0",
167
"destinationId": null,
168
"destinationNumber": "0",
169
"type": "SPAN",
170
"labelSetItemId": "9I-5oYKvnzJRWHgsZrDe_",
171
"status": "ACCEPTED",
172
"createdAt": "2021-09-03T08:13:40.721Z",
173
"updatedAt": "2021-09-03T08:13:40.762Z"
174
},
175
"destination": {
176
"labelText": null,
177
"id": "508985812",
178
"documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
179
"labeledByUserId": 752,
180
"startCellIndex": 0,
181
"startCellLine": 0,
182
"startTokenIndex": 0,
183
"startCharIndex": 0,
184
"endCellIndex": 0,
185
"endCellLine": 0,
186
"endTokenIndex": 2,
187
"endCharIndex": 5,
188
"layer": 0,
189
"counter": 0,
190
"labeledBy": "LABELER",
191
"acceptedByUserId": null,
192
"rejectedByUserId": null,
193
"originId": null,
194
"originNumber": "0",
195
"destinationId": null,
196
"destinationNumber": "0",
197
"type": "SPAN",
198
"labelSetItemId": "vGOy0ZKA-2rqK7netKz9I",
199
"status": "ACCEPTED",
200
"createdAt": "2021-09-03T08:13:38.262Z",
201
"updatedAt": "2021-09-03T08:13:38.330Z"
202
}
203
},
204
{
205
"labelText": null,
206
"id": "508985830",
207
"documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
208
"labeledByUserId": 752,
209
"startCellIndex": 0,
210
"startCellLine": 0,
211
"startTokenIndex": 14,
212
"startCharIndex": 0,
213
"endCellIndex": 0,
214
"endCellLine": 0,
215
"endTokenIndex": 18,
216
"endCharIndex": 6,
217
"layer": 0,
218
"counter": 0,
219
"labeledBy": "LABELER",
220
"acceptedByUserId": null,
221
"rejectedByUserId": null,
222
"originId": null,
223
"originNumber": "0",
224
"destinationId": null,
225
"destinationNumber": "0",
226
"type": "SPAN",
227
"labelSetItemId": "9I-5oYKvnzJRWHgsZrDe_",
228
"status": "ACCEPTED",
229
"createdAt": "2021-09-03T08:13:40.721Z",
230
"updatedAt": "2021-09-03T08:13:40.762Z"
231
}
232
]
233
}
Copied!

Example (token-based with character-based labeling)

1
{
2
"sentences": [
3
{
4
"id": 0,
5
"content": "The Little Prince is a novella by French aristocrat , writer , and aviator Antoine de Saint - Exupéry .",
6
"tokens": ["The","Little","Prince","is","a","novella","by","French","aristocrat",",","writer",",","and","aviator","Antoine","de","Saint","-","Exupéry","."
7
],
8
"labels": [
9
{
10
"layer": 2,
11
"sidS": 0,
12
"s": 5,
13
"charS": 0,
14
"sidE": 0,
15
"e": 5,
16
"charE": 4,
17
"l": "dKXDeLxSHz1wZdXvA5yQz",
18
"id": "dKXDeLxSHz1wZdXvA5yQz:2:0:5:0:0:5:4:0",
19
"deleted": false,
20
"labeledBy": "LABELER",
21
"labeledByUserId": 752,
22
"hashCode": "dKXDeLxSHz1wZdXvA5yQz:2:0:0:5:0:0:0:5:4:0:SPAN:undefined:undefined",
23
"documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
24
"comments": []
25
}
26
]
27
}
28
],
29
"labelSets": [
30
{
31
"labelItems": [
32
{
33
"id": "O1T-l9CGbonHyxj0GOtAo",
34
"labelName": "Noun phrase",
35
"parentId": null,
36
"color": "#ff8000"
37
},
38
{
39
"id": "dKXDeLxSHz1wZdXvA5yQz",
40
"labelName": "NN",
41
"parentId": "O1T-l9CGbonHyxj0GOtAo",
42
"color": "#ff8000"
43
},
44
{
45
"id": "pWUd_Sa1bAiFe38MzU8OL",
46
"labelName": "NNP",
47
"parentId": "O1T-l9CGbonHyxj0GOtAo",
48
"color": "#ff8000"
49
},
50
{
51
"id": "6zUYagMuBuYmal8zITzqZ",
52
"labelName": "Verb phrase",
53
"parentId": null,
54
"color": "#df3920"
55
},
56
{
57
"id": "-8zD8jA9XRKQJBNL5snmp",
58
"labelName": "VBT",
59
"parentId": "6zUYagMuBuYmal8zITzqZ",
60
"color": "#df3920"
61
},
62
{
63
"id": "dZ9UplPt07D97EmrV5Dpn",
64
"labelName": "VBD",
65
"parentId": "6zUYagMuBuYmal8zITzqZ",
66
"color": "#df3920"
67
},
68
{
69
"id": "m59PdwZSqx50OY4K58vCw",
70
"labelName": "VBN",
71
"parentId": "6zUYagMuBuYmal8zITzqZ",
72
"color": "#df3920"
73
},
74
{
75
"id": "BCn1q1clVI9oAje2boyLX",
76
"labelName": "VBI",
77
"parentId": "6zUYagMuBuYmal8zITzqZ",
78
"color": "#df3920"
79
},
80
{
81
"id": "BwjMomxD2E_UJoCNuy_IL",
82
"labelName": "VB",
83
"parentId": "6zUYagMuBuYmal8zITzqZ",
84
"color": "#df3920"
85
}
86
]
87
}
88
],
89
"labels": [
90
{
91
"labelText": null,
92
"id": "509007771",
93
"documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
94
"labeledByUserId": 752,
95
"startCellIndex": 0,
96
"startCellLine": 0,
97
"startTokenIndex": 5,
98
"startCharIndex": 0,
99
"endCellIndex": 0,
100
"endCellLine": 0,
101
"endTokenIndex": 5,
102
"endCharIndex": 4,
103
"layer": 2,
104
"counter": 0,
105
"labeledBy": "LABELER",
106
"acceptedByUserId": null,
107
"rejectedByUserId": null,
108
"originId": null,
109
"originNumber": "0",
110
"destinationId": null,
111
"destinationNumber": "0",
112
"type": "SPAN",
113
"labelSetItemId": "dKXDeLxSHz1wZdXvA5yQz",
114
"status": "ACCEPTED",
115
"createdAt": "2021-09-03T09:22:58.456Z",
116
"updatedAt": "2021-09-03T09:22:58.512Z"
117
}
118
]
119
}
Copied!

Example (token-based with bounding-box labeling)

1
{
2
"sentences": [
3
{
4
"id": 0,
5
"content": "73",
6
"tokens": ["73"],
7
"labels": [
8
{
9
"layer": 0,
10
"sidS": 0,
11
"s": 0,
12
"charS": 0,
13
"sidE": 0,
14
"e": 0,
15
"charE": 1,
16
"l": "fDsCQJFyWy5LnMPtHK4DC",
17
"id": "fDsCQJFyWy5LnMPtHK4DC:0:0:0:0:0:0:1:0",
18
"deleted": false,
19
"labeledBy": "LABELER",
20
"labeledByUserId": 1,
21
"hashCode": "fDsCQJFyWy5LnMPtHK4DC:0:0:0:0:0:0:0:0:1:0:SPAN:undefined:undefined",
22
"documentId": "dac38af2-cfb3-4007-b2ca-302dc8c450fe",
23
"comments": []
24
}
25
]
26
}
27
],
28
"labelSets": [
29
{
30
"labelItems": [
31
{ "id": "fDsCQJFyWy5LnMPtHK4DC", "labelName": "Queue number" }
32
]
33
}
34
],
35
"labels": [
36
{
37
"labeledBy": "PRELABELED",
38
"labeledByUserId": null,
39
"acceptedByUserId": null,
40
"rejectedByUserId": null,
41
"type": "BOUNDING_BOX",
42
"status": "LABELED",
43
"startCellIndex": 0,
44
"startCellLine": 0,
45
"startTokenIndex": 0,
46
"startCharIndex": 0,
47
"endCellIndex": 0,
48
"endCellLine": 0,
49
"endTokenIndex": 0,
50
"endCharIndex": 1,
51
"layer": 0,
52
"counter": 0,
53
"pageIndex": 0,
54
"x0": 228,
55
"y0": 114,
56
"x1": 286,
57
"y1": 114,
58
"x2": 286,
59
"y2": 158,
60
"x3": 228,
61
"y3": 158,
62
"nodeCount": 4
63
},
64
{
65
"labelText": null,
66
"labeledByUserId": 1,
67
"startCellIndex": 0,
68
"startCellLine": 0,
69
"startTokenIndex": 0,
70
"startCharIndex": 0,
71
"endCellIndex": 0,
72
"endCellLine": 0,
73
"endTokenIndex": 0,
74
"endCharIndex": 1,
75
"layer": 0,
76
"counter": 0,
77
"labeledBy": "LABELER",
78
"acceptedByUserId": null,
79
"rejectedByUserId": null,
80
"originId": null,
81
"originNumber": "0",
82
"destinationId": null,
83
"destinationNumber": "0",
84
"type": "SPAN",
85
"labelSetItemId": "fDsCQJFyWy5LnMPtHK4DC",
86
"status": "ACCEPTED"
87
}
88
],
89
"pages": [
90
{
91
"pageIndex": 0,
92
"pageHeight": 619,
93
"pageWidth": 551
94
}
95
]
96
}
Copied!

Datasaur Schema Format

Datasaur Schema is a customized JSON format that is designed to fit all available project types in Datasaur app. This format can be used for mixed project type, e.g. Token + Document labeling. You will receive all label and answer combined in one exported file.
A Datasaur Schema contains the following data structures.
  1. 1.
    version: version number of Datasaur schema.
  2. 2.
    Sentence Field
    1. 1.
      content: the text of the sentence.
    2. 2.
      tokens: the tokens form of the sentence.
    3. 3.
      metadata: contains additional information for a line.
  3. 3.
    labelerInfo: the information about the labeler.
    1. 1.
      id: the unique identifier of a labeler (each labeler has a different id).
    2. 2.
      email: email that labeler used when signing in.
    3. 3.
      displayName: the display name of the email.
  4. 4.
    labelSet: contains all the label items that you used for the project.
    1. 1.
      index: the position of the label set in UI
    2. 2.
      labelItems: an array of labelItems for a label set
      1. 1.
        id: id of the labelSet
      2. 2.
        labelName: the displayed name of the label set item
      3. 3.
        parentId: id of the parent label set item
      4. 4.
        color: the color of the label set item
  5. 5.
    labels: an array of labels for the document. Labels consist of spanLabels, arrowLabels, boundingBoxLabels, timeLabels.
    1. 1.
      spanLabels are all labels that are applied directly to the token/sentence.
    2. 2.
      arrowLabels are all labels that are applied in the top of arrow.
    3. 3.
      boundingBoxLabels are all labels that are applied in the top of OCR documents.
    4. 4.
      timeLabels are all labels that are applied in the top of audio waveform.