Salience

The Salience Developer Hub

Welcome to the Salience developer hub. You'll find comprehensive guides and documentation to help you start working with Salience as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started

Lex Utils currently contains two utility functions that can be helpful in preparing content for use with Salience. These are language detection and html extraction. Like Salience, Lex Utils is provided as a .so (linux) or .dll (windows) and wrappers in c, java, python and .NET.

The Lex Utils objecst are not thread safe, but they are small and one can be created on each thread to support multithreaded environments.

Language Detection

LexLanguageUtilities provides the ability to classify text into one of the [languages supported by Salience]. Language Detection only works for supported languages, content in other languages will not be correctly identified.

First, a LexLanguageUtilities object must be created:

.NET new LexLanguagelUtilities(string dataPath)
Java new LexLanguageUtilities(string dataPath)
Python session = openLanguageSession(string dataPath)
C lxaOpenLexLanguageSession(char* dataPath, LexLanguageSession ** ppSession)

If you wish to perform language detection on a machine that will not have Salience installed, please contact [Lexalytics Support] to obtain a languages.bin file that can be provided to all constructors in replacement of the data directory path. Once you have languages.bin, just give the full path to that file in replacement of the path to the salience data directory.

Once a session has been opened, a LanguageRecommendation object can be obtained for any text:

.NET LanguageRecommendation LexLanguageUtilities.GetLanguage(string text)
Java LanguageRecommendation LexLanguageUtilities.GetLanguage(string text)
Python dict getLanguage(LanguageSession session, string text)
C lxaGetLanguage(LexLanguageSession* pSession, char* acText, lxaLanguageRecommendation** pResults)

The results are split into a best match and a list of how each possible language scored. Each language is provided as an (internal) code number, a language string, the score for that language, and what the optimal score for this text would have been. Text that scores very low compared to the optimal may be gibberish or an unsupported language. If you have text in multiple languages, you will get multiple language results with similar scores, with the ratio of score to perfect score approximating the ratio of each language.

.NET

LanguageResult

nLanguageCodeinternal code representing this language
sLanguageNamename of the language
fScoremeasure of how likely it is that this is the language in question.
fPerfectScorethe score to compare fScore against

LanguageRecommendation

RecommendationA single LanguageResult object representing the most likely language
vAllResultsA vector of LanguageResults for all languages considered

Java

LanguageResult

nLanguageCodeinternal code representing this language
sLanguageNamename of the language
fScoremeasure of how likely it is that this is the language in question.
fPerfectScorethe score to compare fScore against

LanguageRecommendation

RecommendationA LanguageResult object representing the most likely language
vAllResultsA vector of LanguageResults for all languages considered

Python

best matchLanguage code for recommended language
bestMatchNameName of recommended language
bestMatchScore Score for recommended language
perfectScoreScore to compare individual scores against
otherLanguagesList of all language results
otherLanguages\languageLanguage code for other languages
otherLanguages\languageNameName of other languages
otherLanguages\scoreMeasure of how well the other languages matched.

C

lxaLanguageRecommendation

nRecommendedlanguage code for recommended language
fRecommendedScorescore for the recommended language
nChoicesnumber of languages considered, also the length of the arrays
pPossibleLanguagesarray of language codes for each language considered
pLanguageScoresarray of language scores for each language, corresponding to indices in pPossibleLanguages
fWordCountTotal words and bigrams considered, for use in interpreting scores

C provides an additional function:

lxaGetLanguageName(int nLanguageCode, char** acOutText);

to transform a language code to its name

Html Extraction

LexHtmlUtilities removes html tags from a document, and attempts to strip out unrelated content like ads and sidebars. Stripping out unrelated content will sometimes remove some of the article text, but not significant portions. This is particularly noticable if you provide non-html content: in that case you'll get the text back with some sentences removed for being 'off topic', so separating out html and non-html content before using the html extractor is recommended.

First, a LexHtmlUtilities object must be created:

.NETnew LexHtmlUtilities(string dataPath)
Javanew LexHtmlUtilities(string dataPath)
Pythonsession = openHtmlSession(string dataPath)
ClxaOpenLexHtmlSession(char* dataPath, LexHtmlSession ** ppSession)

Then simply pass in content to the extraction function to get stripped text back:

.NETstring LexHtmlUtilities.StripHtml(string htmlText)
Javastring LexHtmlUtilities.ExtractTextFromHtml(string htmlText)
Pythonstring extractText(HTMLSession session, string htmlText)
ClxaExtractTextFromHtml(LexHtmlSession* pSession, char* acInText, char** acOutText)

Updated 5 months ago

LexUtils


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.