Comment on page
Creating a Project
After signing in, you will be automatically directed to your personal workspace. If you find yourself in your personal workspace, switch to your team workspace. You can do so by selecting on your avatar in the top right. Choose “Switch Workspace” and then select your team workspace. You will be brought to your Project page of the workspace. The project page allows the Admin of the team to create projects. On this page you can see the Project shortcuts and the list of Projects that you are working on.
Creating a Project can be done by clicking on the Create project button: this enables you to create any type of project. You can also create a project by selecting one of the Project Template shortcuts: these selections contain pre-selected settings for that specific use-case. In this article, we will walk through creating a new project.
Ready to make our first project? We are going to create a Token-Based project together (span-based labeling). This type of project enables you to do use-case like Named Entity Recognition, Parts of Speech, and more. Any workflow that requires labeling specific words and/or phrases can be done with Datasaur’s Token-based project. If you would like a tutorial on creating a project for Row-based projects (textual classification), Audio, OCR, Bounding Box, or Document-Based Projects please watch their corresponding Youtube videos. Once you have selected “Create Project” you will find yourself in the Project Creation Wizard.
The Project Creation Wizard is a tool for creating custom Projects. It has five basic steps: Upload, Preview, and Labeler Tasks, Assignment, and Project Settings.
Step 1: Upload
You will notice that on the bottom of this page is a list of the file formats Datasaur natively accepts for each project type. As we are creating a Token-Based project, these are the formats we can upload: .conll,.conllu,.json,.tsv,.txt (.doc,.docx,.ppt,.pptx via OCR ) Without needing to match any format, the easiest format to upload is a .txt file, so we’ll do that to begin our product.
Upload your file(s)
Here, we’ve uploaded one .txt file but you can upload multiple .txt files at once for a single project. You are able to upload multiple files, but all files in a Project must be the same file format. Uploading the data can be done in two ways: drag and drop and browsing files from your hard drive. If you are interested in creating Projects via API, you can find here. Note: the maximum file size allowed is 50 MB.
In Step 2, we get to decide two different options for our data: separation of lines and the tokenizer. Line Separator Line Separator decides how your rows in the labeling interface are split. You have two native options New Line and Dot (.). New Line will create a new row for every new line that was in your original data. Dot(.) will create a new line after each “.” In your data. Tokenizer
Datasaur offers two options here for your tokenizer: whitespace and wink.
Whitespace will be akin to natural language where punctuation will be joined together with letter as you seen in the picture below.
Wink will separate certain punctuation marks from the letter as seen below.
In this last step, you must choose whether you want to label individual tokens or answer questions about the text.
There are three task types that you can choose: span-based, row-based, and document-based.
In the third step we are creating the labels that we would like to apply to our labelset. There are three different ways we can upload / create our labels.
Create Labels in the UI
Select “Create your Own” to simply begin manually typing in your labels (as you seen in the .gif below). You can also manually select the color for each of your labels.
Upload Labels from a CSV
Select the white space to upload a CSV of your labels from your local drive. A good question you may have is – what format is the CSV? The formatting of the CSV is very simple: your first label is in the A1 cell, your subsequent labels should go down the A column (A2, A3, A4, A5, etc).
Upload Labels from your Team’s Saved Library
In your team workspace, we have a page called Labelset Management. This page allows you to create, edit, and delete labelsets. This enables your team to save all your labelsets. Utilizing this method means you do not have to re-upload or re-create a labelset each time you create a new project. Span Label Settings
On the bottom of the page, you’ll see a section called “Span Labeling.” We have four different options in this section.
Limit selection to a maximum of 1 token: You need every token in the document labeled.
Tokens and token spans should have at most one label: multiple labels to a single token or token span will not be allowed.\
Allow arrows to be drawn between labels: allows you to draw arrows from one label to another to annotate relationships between words. For example, this is useful for showing that an adjective is related to a noun, or a pronoun is referring to a person.
Default text: select whether labelers will perform a token or character selection for their labels. For example, will labelers be applying labels to whole words at once (token) or will they be able to label the individual letters within a word (character). Note: Some languages may require you to change the selection to character selection, i.e. Mandarin, Korean, or Thai.
In this step we get to chose who is going to be a labeler and who is going a reviewer. When assigning personnel, you will have three choices: Labeler, Labeler & Reviewer, and Reviewer.
Admins will only have two available choices: Labeler & Reviewer and Reviewer. Admin will always at least have the privilege of having access to Reviewer Mode for any project.
Here we set how many labelers have to agree on a label for that label to automatically be accepted by Datasaur. Peer Review Consensus slider allows you to determine the threshold at which labels will be automatically accepted. For highly sensitive projects where there is no room for error, you may want to ensure unanimity from all assigned labelers. For less sensitive projects where efficiency and cost are more important than accuracy, a majority vote may be sufficient. Any label where the threshold is not met will need to be manually reviewed by you, the project creator / reviewer.
If you check No Consensus, all of your labelers’ labels will be treated as conflicting labels.
In this step we chose some final, advanced admin settings for the project. Please keep in mind most of these choices are made for advanced requirements.
Label set modification allows your labelers to edit and add labels to the project within the labeling interface.
Text Modification enables your labelers to edit the text of the dataset.
Mask Personally Identifiable Information allows admin to decide whether sensitive information should be covered by asterisks or random characters. It also allows the admin to decide what type of information is masked (for example: addresses, social security numbers, company name, etc)
Confirm unapplied label classes will be presented will all of the labels that were not applied in the project
Show labeler names in Review Mode if you would like to mitigate the chances of bias you can select to not show their name in Reviewer Mode
Show rejected labels will allow reviewers to be able to see all labels that they have rejected
Show labels from inactive label set if your project has multiple label sets, the enables the reviewer to see the labels from every labelset all at once.
Show original sentence means the reviewer will see all the original sentences compared to any edits the labelers have made Set notification – by default Reviewers will be notified the project is ready when all of the labelers have marked their work as complete. This slider allows you to manually set which number of completion will trigger an email notification.
Set notification – by default Reviewers will be notified the project is ready when all of the labelers have marked their work as complete. This slider allows you to manually set which number of completion will trigger an email notification. At this point, we have officially created our project and we can select "Launch project" in the bottom right corner. Happy Labeling!
Last modified 25d ago