Salience

The Salience Developer Hub

Welcome to the Salience developer hub. You'll find comprehensive guides and documentation to help you start working with Salience as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started

Lex Utils currently contains two utility functions that can be helpful in preparing content for use with Salience. These are language detection and html extraction. Like Salience, Lex Utils is provided as a .so (linux) or .dll (windows) and wrappers in c, java, python and .NET.

The Lex Utils objecst are not thread safe, but they are small and one can be created on each thread to support multithreaded environments.

Language Detection

LexLanguageUtilities provides the ability to classify text into one of the [languages supported by Salience]. Language Detection only works for supported languages, content in other languages will not be correctly identified.

First, a LexLanguageUtilities object must be created:

.NET

new LexLanguagelUtilities(string dataPath)

Java

new LexLanguageUtilities(string dataPath)

Python

session = openLanguageSession(string dataPath)

C

lxaOpenLexLanguageSession(char* dataPath, LexLanguageSession ** ppSession)

If you wish to perform language detection on a machine that will not have Salience installed, please contact [Lexalytics Support] to obtain a languages.bin file that can be provided to all constructors in replacement of the data directory path. Once you have languages.bin, just give the full path to that file in replacement of the path to the salience data directory.

Once a session has been opened, a LanguageRecommendation object can be obtained for any text:

.NET

LanguageRecommendation LexLanguageUtilities.GetLanguage(string text)

Java

LanguageRecommendation LexLanguageUtilities.GetLanguage(string text)

Python

dict getLanguage(LanguageSession session, string text)

C

lxaGetLanguage(LexLanguageSession* pSession, char* acText, lxaLanguageRecommendation** pResults)

The results are split into a best match and a list of how each possible language scored. Each language is provided as an (internal) code number, a language string, the score for that language, and what the optimal score for this text would have been. Text that scores very low compared to the optimal may be gibberish or an unsupported language. If you have text in multiple languages, you will get multiple language results with similar scores, with the ratio of score to perfect score approximating the ratio of each language.

.NET

LanguageResult

nLanguageCode

internal code representing this language

sLanguageName

name of the language

fScore

measure of how likely it is that this is the language in question.

fPerfectScore

the score to compare fScore against

LanguageRecommendation

Recommendation

A single LanguageResult object representing the most likely language

vAllResults

A vector of LanguageResults for all languages considered

Java

LanguageResult

nLanguageCode

internal code representing this language

sLanguageName

name of the language

fScore

measure of how likely it is that this is the language in question.

fPerfectScore

the score to compare fScore against

LanguageRecommendation

Recommendation

A LanguageResult object representing the most likely language

vAllResults

A vector of LanguageResults for all languages considered

Python

best match

Language code for recommended language

bestMatchName

Name of recommended language

bestMatchScore

Score for recommended language

perfectScore

Score to compare individual scores against

otherLanguages

List of all language results

otherLanguages\language

Language code for other languages

otherLanguages\languageName

Name of other languages

otherLanguages\score

Measure of how well the other languages matched.

C

lxaLanguageRecommendation

nRecommended

language code for recommended language

fRecommendedScore

score for the recommended language

nChoices

number of languages considered, also the length of the arrays

pPossibleLanguages

array of language codes for each language considered

pLanguageScores

array of language scores for each language, corresponding to indices in pPossibleLanguages

fWordCount

Total words and bigrams considered, for use in interpreting scores

C provides an additional function:

lxaGetLanguageName(int nLanguageCode, char** acOutText);

to transform a language code to its name

Html Extraction

LexHtmlUtilities removes html tags from a document, and attempts to strip out unrelated content like ads and sidebars. Stripping out unrelated content will sometimes remove some of the article text, but not significant portions. This is particularly noticable if you provide non-html content: in that case you'll get the text back with some sentences removed for being 'off topic', so separating out html and non-html content before using the html extractor is recommended.

First, a LexHtmlUtilities object must be created:

.NET

new LexHtmlUtilities(string dataPath)

Java

new LexHtmlUtilities(string dataPath)

Python

session = openHtmlSession(string dataPath)

C

lxaOpenLexHtmlSession(char* dataPath, LexHtmlSession ** ppSession)

Then simply pass in content to the extraction function to get stripped text back:

.NET

string LexHtmlUtilities.StripHtml(string htmlText)

Java

string LexHtmlUtilities.ExtractTextFromHtml(string htmlText)

Python

string extractText(HTMLSession session, string htmlText)

C

lxaExtractTextFromHtml(LexHtmlSession* pSession, char* acInText, char** acOutText)

Updated 7 months ago

LexUtils


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.