Importable Format
Importable
format is a JSON format which is used to import data to Datasaur project.A
Importable
JSON format may contain the following data structures:- 1.type: the
Importable
type (which identified with value "BOUNDING_BOX") - 2.cells: an array containing the intersection of a row and a column
- 1.content: a sentence in the cell
- 2.index: the column index for the cell
- 3.line: the row index for the cell
- 4.metadata: additional information for a cell
- 5.tokens: array of strings to define custom tokenization
- 3.labelSets: the label set which is used by the project.
- 4.labels: an array of labels
- 1.Span label type
- 1.id: A unique number
- 2.type: Identified with value "SPAN"
- 3.startCellLine, startCellIndex, startTokenIndex, startCharIndex. The starting position of the span label. Please refer to this for a thorough explanation of how cells, tokens, and characters are positioned.
- 4.endCellLine, endCellIndex, endTokenIndex, endCharIndex. The ending position of the span label. Please refer to this for a thorough explanation of how cells, tokens, and characters are positioned.
- 5.layer: The label set index to which this label belongs.
- 6.counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
- 7.labelSetItemId: The ID of this label within the label set.
- 2.Arrow label type
- 1.id: A unique number
- 2.type: Identified with value "ARROW"
- 3.originId: The id of the label from which this arrow starts.
- 4.destinationId: The id of the label to which this arrow ends.
- 5.startCellLine, startCellIndex, startTokenIndex, startCharIndex. This is the same as the origin label's.
- 6.endCellLine, endCellIndex, endTokenIndex, endCharIndex. This is the same as the destination label's.
- 7.layer: The label set index to which this label belongs.
- 8.counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
- 9.labelSetItemId: The ID of this label within the label set. Use an empty string if this arrow does not have a label.
- 3.Bounding box label type
- 1.type: identified with value "BOUNDING_BOX"
- 2.startCellLine: starting line sentence position
- 3.startCellIndex: starting line column position
- 4.startTokenIndex: starting token index position
- 5.startCharIndex: starting character index position (relative to tokenIndex, start from 0 again when tokenIndex incremented)
- 6.endCellLine: ending line sentence position
- 7.endCellIndex: ending line column position
- 8.endTokenIndex: ending token index position
- 9.endCharIndex: ending character index position
- 10.layer: the layer where the token is positioned
- 11.counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
- 12.pageIndex: index of the page if the document contain multiple pages
- 13.nodeCount: total number of the bounding box points
- 14.x0: x coordinate of top left position of the bounding box
- 15.y0: y coordinate of top left position of the bounding box
- 16.x1: x coordinate of top right position of the bounding box
- 17.y1: y coordinate of top right position of the bounding box
- 18.x2: x coordinate of bottom right position of the bounding box
- 19.y2: y coordinate of bottom right position of the bounding box
- 20.x3: x coordinate of bottom left position of the bounding box
- 21.y3: y coordinate of bottom left position of the bounding box
- 5.pages: an array of page information
- 1.pageIndex: index of the page if the document contain multiple pages
- 2.pageHeight: original page height in pixel
- 3.pageWidth: original page width in pixel
{
"cells": [
{
"content": "The quick brown fox jumps over the lazy dog",
"tokens": ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
"line": 0,
"index": 0,
"metadata": []
}
],
"labelSets": [
{
"name": "Subjects",
"index": 0,
"labelItems": [
{ "id": "FOX_ID", "labelName": "Fox" },
{ "id": "DOG_ID", "labelName": "Dog" }
]
},
{
"name": "Verbs",
"index": 1,
"labelItems": [
{ "id": "JUMP_ID", "labelName": "Jump" }
]
}
],
"labels": [
{
"id": 1,
"type": "SPAN",
"startCellLine": 0,
"startCellIndex": 0,
"startTokenIndex": 1,
"startCharIndex": 0,
"endCellLine": 0,
"endCellIndex": 0,
"endTokenIndex": 3,
"endCharIndex": 2,
"layer": 0,
"counter": 0,
"labelSetItemId": "FOX_ID"
},
{
"id": 2,
"type": "SPAN",
"startCellLine": 0,
"startCellIndex": 0,
"startTokenIndex": 6,
"startCharIndex": 0,
"endCellLine": 0,
"endCellIndex": 0,
"endTokenIndex": 8,
"endCharIndex": 2,
"layer": 0,
"counter": 0,
"labelSetItemId": "DOG_ID"
},
{
"id": 3,
"originId": 1,
"destinationId": 2,
"type": "ARROW",
"startCellLine": 0,
"startCellIndex": 0,
"startTokenIndex": 1,
"startCharIndex": 0,
"endCellLine": 0,
"endCellIndex": 0,
"endTokenIndex": 8,
"endCharIndex": 2,
"layer": 1,
"counter": 0,
"labelSetItemId": "JUMP_ID"
}
],
"name": "Example"
}
The above JSON will produce the following output:

The above project contains 1 cell and 3 labels (2 span labels and 1 arrow label) from 2 different label sets.
Refer to this for a more thorough explanati
There is only one Cell in the project, and for that Cell we set:
- content as
The quick brown fox jumps over the lazy dog
. - tokens as
["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
, because we use a simple whitespace tokenization. You can define your own tokenization method here. - line as 0, since the cell is on the first row.
- index as 0, since the cell is on the left most column.
- metadata as
[]
, since we don't display any metadata.
There will be 2 label sets named
Subjects
(indexed 0) and Verbs
(indexed 1), please refer to this for thorough explanation of Label Set Index. Label Items within each Label Set must have unique id, for example in Subjects
Label Set we have FOX_ID
and DOG_ID
.The first label (
Fox
):- starts from the first character of the second token in the first cell, hence we set:
- startCellLine as 0, since the first cell's Line is 0.
- startCellIndex as 0, since the first cell's Index is 0.
- startTokenIndex as 1, since the label starts from the second token.
- startCharIndex as 0, since the label starts from the first character of the token.
- ends at the third character of the fourth token in the first cell, hence we set:
- endCellLine as 0
- endCellIndex as 0
- endTokenIndex as 3
- endCharIndex as 2
- comes from the first Label Set, hence we set:
- layer as 0
- labelSetItemId as
FOX_ID
The second label (
Dog
) JSON follows the same idea of first label (Fox
).The third label (
Jump
):- It originates from the first label (
Fox
) and ends at the second label (Dog
), hence we set:- originId as 1, because the first label has id 1.
- destinationId as 2, because the second label has id 2
- startCellLine, startCellIndex, startTokenIndex, startCharIndex are the same with the first label's (
Fox
). - endCellLine, endCellIndex, endTokenIndex, endCharIndex are the same with the second label's (
Dog
). - comes from the second Label Set, hence we set:
- layer as 0
- labelSetItemId as
JUMP_ID
{
"cells": [
{
"content": "SHIHLIN TAIWAN",
"index": 0,
"line": 0,
"metadata": [],
"tokens": [
"SHIHLIN",
"TAIWAN"
]
},
{
"content": "STREET SNACKS",
"index": 0,
"line": 1,
"metadata": [],
"tokens": [
"STREET",
"SNACKS"
]
}
],
"labelSets": [],
"labels": [
{
"startCellLine": 0,
"startCellIndex": 0,
"startTokenIndex": 0,
"startCharIndex": 0,
"endCellLine": 0,
"endCellIndex": 0,
"endTokenIndex": 0,
"endCharIndex": 6,
"layer": 0,
"counter": 0,
"pageIndex": 0,
"type": "BOUNDING_BOX",
"nodeCount": 4,
"x0": 130,
"y0": 154,
"x1": 255,
"y1": 154,
"x2": 255,
"y2": 186,
"x3": 130,
"y3": 186
},
{
"startCellLine": 0,
"startCellIndex": 0,
"startTokenIndex": 1,
"startCharIndex": 0,
"endCellLine": 0,
"endCellIndex": 0,
"endTokenIndex": 1,
"endCharIndex": 5,
"layer": 0,
"counter": 0,
"pageIndex": 0,
"type": "BOUNDING_BOX",
"nodeCount": 4,
"x0": 261,
"y0": 154,
"x1": 375,
"y1": 154,
"x2": 375,
"y2": 186,
"x3": 261,
"y3": 186
}
],
"name": "receipt.jpg",
"pages": [
{
"pageIndex": 0,
"pageHeight": 619,
"pageWidth": 551
}
],
"type": "BOUNDING_BOX"
}
Last modified 5mo ago