Salience

The Salience Developer Hub

Welcome to the Salience developer hub. You'll find comprehensive guides and documentation to help you start working with Salience as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started

.../data/tokenizer

This directory contains the files that control the operation of the tokenizer. The tokenizer is used to identify the individual tokens in a block of content. Generally a token is equivalent to a single word, but punctuation, contractions, symbols, etc. are also examples of tokens that occur within text.

The following files may be customized by users in the tokenizer section of a user directory. Click the name of the file for more information below.

breaks.datA list of words containing periods (common abbreviations)
complexstems.datA list of words that are commonly expanded in social media content to emphasize sentiment
sentencepunctuation.datList of common characters used as terminating sentence punctuation
subwords.datA list of words that can be found concatenated in social media content, particularly in hashtags
suffixbreaks.datA list of contractions recognized by the tokenizer
tokenizer.datA tokenizer-specific data file

breaks.dat

These are words that end in an end-of-sentence marker but should not be interpreted to break a sentence. For example:

Mr.
U.S.

This will prevent text like The U.S.-based company, owned by Mr. Catlin and Mr. Marshall, ... from being broken up into multiple sentences.

complexstems.dat

These are words that are commonly expanded in social media content, and done so in order to add emphasis to sentiment. This file allows a sentiment multiplier to be applied when encountering any expansions of these words.

The format of the file is:

<word-that-could-be-expanded> <tab> <sentiment-multiplier>

Example:

love    1.2

This would apply a multiplier of 1.2 to sentiment detected for the word "love" in occurrences such as "I loooove Salience!".

sentencepunctuation.dat

This datafile contains common sentence terminating punctuation. It can be extended by users in user/tokenizer if additional terminating punctuation is observed in target content that is being analyzed in Salience.

subwords.dat

These are words that are commonly concatenated in social media content, specifically in hashtags. Salience uses this file to deconstruct concatenated phrases in hashtags in order to apply sentiment.

suffixbreaks.dat

These are contractions recognized by the tokenizer and separated out into tokens separate from the root (generally a pronoun) for improved POS tagging. Users can extend this datafile in user/tokenizer if additional contractions are noted in the target content being analyzed by Salience.

tokenizer.dat

This is a datafile specific to the Salience tokenizer and should not be modified or overridden.

Updated 5 months ago

.../data/tokenizer


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.