Lex Utils currently contains two utility functions that can be helpful in preparing content for use with Salience. These are language detection and html extraction. Like Salience, Lex Utils is provided as a .so (linux) or .dll (windows) and wrappers in c, java, python and .NET.
The Lex Utils objecst are not thread safe, but they are small and one can be created on each thread to support multithreaded environments.
Language Detection
LexLanguageUtilities provides the ability to classify text into one of the [languages supported by Salience]. Language Detection only works for supported languages, content in other languages will not be correctly identified.
First, a LexLanguageUtilities object must be created:
.NET |
|
---|---|
Java |
|
Python |
|
C |
|
If you wish to perform language detection on a machine that will not have Salience installed, please contact [Lexalytics Support] to obtain a languages.bin file that can be provided to all constructors in replacement of the data directory path. Once you have languages.bin, just give the full path to that file in replacement of the path to the salience data directory.
Once a session has been opened, a LanguageRecommendation object can be obtained for any text:
.NET |
|
---|---|
Java |
|
Python |
|
C |
|
The results are split into a best match and a list of how each possible language scored. Each language is provided as an (internal) code number, a language string, the score for that language, and what the optimal score for this text would have been. Text that scores very low compared to the optimal may be gibberish or an unsupported language. If you have text in multiple languages, you will get multiple language results with similar scores, with the ratio of score to perfect score approximating the ratio of each language.
.NET
LanguageResult
|
|
---|---|
|
|
|
|
|
|
LanguageRecommendation
|
|
---|---|
|
|
Java
LanguageResult
|
|
---|---|
|
|
|
|
|
|
LanguageRecommendation
|
|
---|---|
|
|
Python
|
|
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
C
lxaLanguageRecommendation
|
|
---|---|
|
|
|
|
|
|
|
|
|
|
C provides an additional function:
lxaGetLanguageName(int nLanguageCode, char** acOutText);
to transform a language code to its name
Html Extraction
LexHtmlUtilities removes html tags from a document, and attempts to strip out unrelated content like ads and sidebars. Stripping out unrelated content will sometimes remove some of the article text, but not significant portions. This is particularly noticable if you provide non-html content: in that case you'll get the text back with some sentences removed for being 'off topic', so separating out html and non-html content before using the html extractor is recommended.
First, a LexHtmlUtilities object must be created:
.NET |
|
---|---|
Java |
|
Python |
|
C |
|
Then simply pass in content to the extraction function to get stripped text back:
.NET |
|
---|---|
Java |
|
Python |
|
C |
|
Updated 11 months ago