Importable Format

Importable format is a JSON format which is used to import data to Datasaur project.

A Importable JSON format may contain the following data structures:

  1. type: the Importable type (which identified with value "BOUNDING_BOX")

  2. cells: an array containing the intersection of a row and a column

    1. content: a sentence in the cell

    2. index: the column index for the cell

    3. line: the row index for the cell

    4. metadata: additional information for a cell

    5. tokens: array of strings to define custom tokenization

  3. labelSets: the label set which is used by the project.

  4. labels: an array of labels

    1. Bounding box label type

      1. type: identified with value "BOUNDING_BOX"

      2. startCellLine: starting line sentence position

      3. startCellIndex: starting line column position

      4. startTokenIndex: starting token index position

      5. startCharIndex: starting character index position (relative to tokenIndex, start from 0 again when tokenIndex incremented)

      6. endCellLine: ending line sentence position

      7. endCellIndex: ending line column position

      8. endTokenIndex: ending token index position

      9. endCharIndex: ending character index position

      10. layer: the layer where the token is positioned

      11. counter:

      12. pageIndex: index of the page if the document contain multiple pages

      13. nodeCount: total number of the bounding box points

      14. x0: x coordinate of top left position of the bounding box

      15. y0: y coordinate of top left position of the bounding box

      16. x1: x coordinate of top right position of the bounding box

      17. y1: y coordinate of top right position of the bounding box

      18. x2: x coordinate of bottom right position of the bounding box

      19. y2: y coordinate of bottom right position of the bounding box

      20. x3: x coordinate of bottom left position of the bounding box

      21. y3: y coordinate of bottom left position of the bounding box

  5. pages: an array of page information

    1. pageIndex: index of the page if the document contain multiple pages

    2. pageHeight: original page height in pixel

    3. pageWidth: original page width in pixel

Example (with bounding box label type)

{
"cells": [
{
"content": "SHIHLIN TAIWAN",
"index": 0,
"line": 0,
"metadata": [],
"tokens": [
"SHIHLIN",
"TAIWAN"
]
},
{
"content": "STREET SNACKS",
"index": 0,
"line": 1,
"metadata": [],
"tokens": [
"STREET",
"SNACKS"
]
}
],
"labelSets": [],
"labels": [
{
"startCellLine": 0,
"startCellIndex": 0,
"startTokenIndex": 0,
"startCharIndex": 0,
"endCellLine": 0,
"endCellIndex": 0,
"endTokenIndex": 0,
"endCharIndex": 6,
"layer": 0,
"counter": 0,
"pageIndex": 0,
"type": "BOUNDING_BOX",
"nodeCount": 4,
"x0": 130,
"y0": 154,
"x1": 255,
"y1": 154,
"x2": 255,
"y2": 186,
"x3": 130,
"y3": 186
},
{
"startCellLine": 0,
"startCellIndex": 0,
"startTokenIndex": 1,
"startCharIndex": 0,
"endCellLine": 0,
"endCellIndex": 0,
"endTokenIndex": 1,
"endCharIndex": 5,
"layer": 0,
"counter": 0,
"pageIndex": 0,
"type": "BOUNDING_BOX",
"nodeCount": 4,
"x0": 261,
"y0": 154,
"x1": 375,
"y1": 154,
"x2": 375,
"y2": 186,
"x3": 261,
"y3": 186
}
],
"name": "receipt.jpg",
"pages": [
{
"pageIndex": 0,
"pageHeight": 619,
"pageWidth": 551
}
],
"type": "BOUNDING_BOX"
}