Importable Format

Importable format is a JSON format which is used to import data to Datasaur project.

A Importable JSON format may contain the following data structures:

type: the Importable type (which identified with value "BOUNDING_BOX")
cells: an array containing the intersection of a row and a column
1. content: a sentence in the cell
2. index: the column index for the cell
3. line: the row index for the cell
4. metadata: additional information for a cell
5. tokens: array of strings to define custom tokenization
labelSets: the label set which is used by the project.
labels: an array of labels
1. Span label type
  1. id: A unique number
  2. type: Identified with value "SPAN"
  3. startCellLine, startCellIndex, startTokenIndex, startCharIndex. The starting position of the span label. Please refer to this for a thorough explanation of how cells, tokens, and characters are positioned.
  4. endCellLine, endCellIndex, endTokenIndex, endCharIndex. The ending position of the span label. Please refer to this for a thorough explanation of how cells, tokens, and characters are positioned.
  5. layer: The label set index to which this label belongs.
  6. counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
  7. labelSetItemId: The ID of this label within the label set.
2. Arrow label type
  1. id: A unique number
  2. type: Identified with value "ARROW"
  3. originId: The id of the label from which this arrow starts.
  4. destinationId: The id of the label to which this arrow ends.
  5. startCellLine, startCellIndex, startTokenIndex, startCharIndex. This is the same as the origin label's.
  6. endCellLine, endCellIndex, endTokenIndex, endCharIndex. This is the same as the destination label's.
  7. layer: The label set index to which this label belongs.
  8. counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
  9. labelSetItemId: The ID of this label within the label set. Use an empty string if this arrow does not have a label.
3. Bounding box label type
  1. type: identified with value "BOUNDING_BOX"
  2. startCellLine: starting line sentence position
  3. startCellIndex: starting line column position
  4. startTokenIndex: starting token index position
  5. startCharIndex: starting character index position (relative to tokenIndex, start from 0 again when tokenIndex incremented)
  6. endCellLine: ending line sentence position
  7. endCellIndex: ending line column position
  8. endTokenIndex: ending token index position
  9. endCharIndex: ending character index position
  10. layer: the layer where the token is positioned
  11. counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
  12. pageIndex: index of the page if the document contain multiple pages
  13. nodeCount: total number of the bounding box points
  14. x0: x coordinate of top left position of the bounding box
  15. y0: y coordinate of top left position of the bounding box
  16. x1: x coordinate of top right position of the bounding box
  17. y1: y coordinate of top right position of the bounding box
  18. x2: x coordinate of bottom right position of the bounding box
  19. y2: y coordinate of bottom right position of the bounding box
  20. x3: x coordinate of bottom left position of the bounding box
  21. y3: y coordinate of bottom left position of the bounding box
pages: an array of page information
1. pageIndex: index of the page if the document contain multiple pages
2. pageHeight: original page height in pixel
3. pageWidth: original page width in pixel

Example (with span and arrow label type)

{
  "cells": [
    {
      "content": "The quick brown fox jumps over the lazy dog",
      "tokens": ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
      "line": 0,
      "index": 0,
      "metadata": []
    }
  ],
  "labelSets": [
    {
      "name": "Subjects",
      "index": 0,
      "labelItems": [
        { "id": "FOX_ID", "labelName": "Fox" },
        { "id": "DOG_ID", "labelName": "Dog" }
      ]
    },
    {
      "name": "Verbs",
      "index": 1,
      "labelItems": [
        { "id": "JUMP_ID", "labelName": "Jump" }
      ]
    }
  ],
  "labels": [
    {
      "id": 1,
      "type": "SPAN",
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 1,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 3,
      "endCharIndex": 2,
      "layer": 0,
      "counter": 0,
      "labelSetItemId": "FOX_ID"
    },
    {
      "id": 2,
      "type": "SPAN",
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 6,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 8,
      "endCharIndex": 2,
      "layer": 0,
      "counter": 0,
      "labelSetItemId": "DOG_ID"
    },
    {
      "id": 3,
      "originId": 1,
      "destinationId": 2,
      "type": "ARROW",
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 1,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 8,
      "endCharIndex": 2,
      "layer": 1,
      "counter": 0,
      "labelSetItemId": "JUMP_ID"
    }
  ],
  "name": "Example"
}

The above JSON will produce the following output:

Explanation

The above project contains 1 cell and 3 labels (2 span labels and 1 arrow label) from 2 different label sets.

Defining the Cells

Refer to this for a more thorough explanati

There is only one Cell in the project, and for that Cell we set:

content as The quick brown fox jumps over the lazy dog.
tokens as ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"], because we use a simple whitespace tokenization. You can define your own tokenization method here.
line as 0, since the cell is on the first row.
index as 0, since the cell is on the left most column.
metadata as [], since we don't display any metadata.

Defining the Label Sets

There will be 2 label sets named Subjects (indexed 0) and Verbs (indexed 1), please refer to this for thorough explanation of Label Set Index. Label Items within each Label Set must have unique id, for example in Subjects Label Set we have FOX_ID and DOG_ID.

Defining the Labels

The first label (Fox):

starts from the first character of the second token in the first cell, hence we set:
- startCellLine as 0, since the first cell's Line is 0.
- startCellIndex as 0, since the first cell's Index is 0.
- startTokenIndex as 1, since the label starts from the second token.
- startCharIndex as 0, since the label starts from the first character of the token.
ends at the third character of the fourth token in the first cell, hence we set:
- endCellLine as 0
- endCellIndex as 0
- endTokenIndex as 3
- endCharIndex as 2
comes from the first Label Set, hence we set:
- layer as 0
- labelSetItemId as FOX_ID

The second label (Dog) JSON follows the same idea of first label (Fox).

The third label (Jump):

It originates from the first label (Fox) and ends at the second label (Dog), hence we set:
- originId as 1, because the first label has id 1.
- destinationId as 2, because the second label has id 2
startCellLine, startCellIndex, startTokenIndex, startCharIndex are the same with the first label's (Fox).
endCellLine, endCellIndex, endTokenIndex, endCharIndex are the same with the second label's (Dog).
comes from the second Label Set, hence we set:
- layer as 0
- labelSetItemId as JUMP_ID

Example (with bounding box label type)

{
  "cells": [
    {
      "content": "SHIHLIN TAIWAN",
      "index": 0,
      "line": 0,
      "metadata": [],
      "tokens": [
        "SHIHLIN",
        "TAIWAN"
      ]
    },
    {
      "content": "STREET SNACKS",
      "index": 0,
      "line": 1,
      "metadata": [],
      "tokens": [
        "STREET",
        "SNACKS"
      ]
    }
  ],
  "labelSets": [],
  "labels": [
    {
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 0,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 0,
      "endCharIndex": 6,
      "layer": 0,
      "counter": 0,
      "pageIndex": 0,
      "type": "BOUNDING_BOX",
      "nodeCount": 4,
      "x0": 130,
      "y0": 154,
      "x1": 255,
      "y1": 154,
      "x2": 255,
      "y2": 186,
      "x3": 130,
      "y3": 186
    },
    {
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 1,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 1,
      "endCharIndex": 5,
      "layer": 0,
      "counter": 0,
      "pageIndex": 0,
      "type": "BOUNDING_BOX",
      "nodeCount": 4,
      "x0": 261,
      "y0": 154,
      "x1": 375,
      "y1": 154,
      "x2": 375,
      "y2": 186,
      "x3": 261,
      "y3": 186
    }
  ],
  "name": "receipt.jpg",
  "pages": [
    {
      "pageIndex": 0,
      "pageHeight": 619,
      "pageWidth": 551
    }
  ],
  "type": "BOUNDING_BOX"
}

Last updated 7 months ago