"Dot space" redirects here. For the animated film, see Dot in Space."␣" redirects here. It is not to be confused with ⌴.
In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE (also ASCII 32) represents a blank space punctuation character in text, used as a word divider in Western scripts.
By default, Datasaur uses the wink-tokenizer to separate text into individual tokens, which are grouped by lines of sentences.
"Wink is a family of open source packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS." (source: https://winkjs.org/about.html)
How does it work? It works by using pattern matching via regular expressions (regex). The list of regexes used by wink-tokenizer can be seen in detail at https://winkjs.org/wink-tokenizer/Tokenizer.html#defineConfig. In summary, it will parse the following by default:
such as $ or £ symbols (r)
any standard unicode emojis e.g. 😊 or 😂 or 🎉 (j)
common emoticons such as :-) or :D (c)
hash tags such as #happy or #followme (h)
any integer, decimal number, fractions such as 19, 2.718 or 1/4 and numerals containing ", - / .", for example 12-12-1924 (n)
ordinals like 1st, 2nd, 3rd, 4th or 12th or 91st (o)
common punctuation such as ? or , ( token becomes fingerprint )
any "quoted text" in the sentence. Note: its default value is false. (q)
for example ~ or + or & or % or / ( token becomes fingerprint )
common representation of time such as 4pm or 16:00 hours (t)
@mention as in github or twitter (m)
word such as faster or résumé or prévenir (w)
To learn more details about the regex patterns used, you can directly view the source code at https://github.com/winkjs/wink-tokenizer/blob/master/src/wink-tokenizer.js