Salience

The Salience Developer Hub

Welcome to the Salience developer hub. You'll find comprehensive guides and documentation to help you start working with Salience as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started

Text Preparation

prepareText

Summary
Prepares a piece of text for processing. This function, or its sister function prepareTextFromFile, must be called every time you want to process a different piece of text. Text should either be 7bit ASCII or UTF8.

This method provides a wrapper around the underlying C API method lxaPrepareText.

Syntax

salience6.prepareText(oSession, sText)

Parameters

oSession

A SalienceSession object previously created via openSession

sText

A text string to be analyzed

Returns

Returns an integer return code. Possible error return codes are specified on the Errors and Warning Codes page.

Notes

Words that exceed 366 characters in length will be truncated. This is twice the length of the longest English word which is not a chemical compound.

Sentences that exceed 1000 words will cause the underlying call to lxaPrepareText to return with LXA_ERROR.

Example

import salience6 as se6
    session = se6.openSession('/path/to/license.v5','/path/to/data')
    ret = se6.prepareText(session,'Lexalytics is based in Amherst, MA.')
    if (ret==0):
        ...extract results from text...
    else:
        if (ret==6):
            print se6.getLastWarnings(session) 
    se6.closeSession(session)

prepareTextFromFile

Summary

Prepares text contents of a file for processing. This function, or its sister function prepareText, must be called every time you want to process a different piece of text. The text file should either be 7bit ASCII or UTF8.

This method provides a wrapper around the underlying C API method lxaPrepareTextFromFile.

Syntax

salience6.prepareTextFromFile(oSession, sFile)

Parameters

oSession

A SalienceSession object previously created via openSession

sFile

Fully-qualified path to a readable text file

Returns

Returns an integer return code. Possible error return codes are specified on the Errors and Warning Codes page.

Notes

Words that exceed 366 characters in length will be truncated. This is twice the length of the longest English word which is not a chemical compound.

Sentences that exceed 1000 words will cause the underlying call to lxaPrepareTextFromFile to return with LXA_ERROR.

Example

import salience6 as se6
    session = se6.openSession('/path/to/license.v5','/path/to/data')
    ret = se6.prepareTextFromFile(session,'/path/to/aFile.txt')
    if (ret==0):
        ...extract results from text...
    else:
        if (ret==6):
            print se6.getLastWarnings(session) 
    se6.closeSession(session)

addSection

Summary

Adds the supplied text into a section of the document for analysis. The supplied text should be UTF-8 encoded, not UTF-16.

This method provides a wrapper around the underlying C API method lxaAddSection.

Syntax

salience6.addSection(oSession, sHeader, sText, nProcess)

Parameters

oSession

A SalienceSession object previously created via openSession

sHeader

A text string specifying the header for the section

sText

A text string for the section

nProcess

Returns

Returns an integer return code. Possible error return codes are specified on the Errors and Warning Codes page.

addSectionFromFile

Summary

Adds the text from the supplied file into a section of the document for analysis. The supplied text should be UTF-8 encoded, not UTF-16.

This method provides a wrapper around the underlying C API method lxaAddSectionFromFile.

Syntax

salience6.addSectionFromFile(oSession, sHeader, sPath, nProcess)

Parameters

oSession

A SalienceSession object previously created via openSession

sHeader

A text string specifying the header for the section

sText

Path to a valid text file containing text for the section

nProcess

Returns

Returns an integer return code. Possible error return codes are specified on the Errors and Warning Codes page.

correctOCRErrors

Summary

Corrects likely OCR errors found in prepared text. This function is useful when processing scanned documents that that have been converted to electronic format through optical character recognition; there is no need to call it if the document originated from an electronic format (word processing, web page, tweet, etc.). If used, it should called after calling text preparation functions (prepareText and prepareTextFromFile), but before any text analytics functions (that is, any of the other functions that are typically called after preparing text).

This method is a a wrapper around the underlying C API method lxaCorrectOCRErrors.

Syntax

salience6.correctOCRErrors(oSession, vAttributes, fConfidence)

Parameters

oSession

A SalienceSession object previously created via openSession

vAttributes

A list of dictionaries, each containing OCR character attributes (see immediately below) for a suspect character in the text. Only those words containing the characters identified in the attribute lists will be subject to correction. If the list is empty, all words in the document text are treated as candidates for correction.

fConfidenceThreshold

A float giving the confidence below which a character in vAttributes will be subject to OCR correction. It is used only when vAttributes is not empty.

Each element of the vAttributes parameter is a dictionary with the following entries:

char_offset

An integer giving the character offset of a suspect OCR'ed character in the original document text

confidence

Optional: A float giving the confidence of the correctness of the suspect OCR'ed character. Ranges from 0 to 1, with 1 being most confident. If not used, you should either omit it from the dictionary or set it to -1.

height

Optional: A float giving the height of the suspect character bounding box, in typographical points. If not used, you should either omit it from the dictionary or set it to -1.

width

Optional: A float giving the width of the suspect character bounding box, in typographical points. If not used, you should either omit it from the dictionary or set it to -1.

y_position

Optional: A float giving the distance of the left side of the suspect character bounding box from left side of page, in typographical points. If not used, you should either omit it from the dictionary or set it to -1.

x_position

Optional: A float giving the distance of top of the suspect character bounding box from top of page, in typographical points. If not used, you should either omit it from the dictionary or set it to -1.

`page'

Optional: An integer giving the page number of the suspect character, counting from page number 0. If not used, you should either omit it from the dictionary or set it to -1.

Returns

Returns the corrected text string if the OCR error correction is successful and a None value otherwise.

Notes

When the argument vAttributes is an empty list, this function will check every character in the document to determine whether it needs to be corrected. However, some OCR engines produce confidence information about confidence information for characters that they believe might have been misrecognized. If this information is available, you can pass this information along in the the vAttributes argument along with a confidence cutoff in fConfidenceThreshold. In this case, only those characters identified in vAttributes whose confidence values fall below fConfidenceThreshold will be corrected. This will speed up the OCR error correction process.

Example

import salience6 as se6
session = se6.openSession('/path/to/license.v5','/path/to/data')
    
# This will correct both "hased" and "Hmherst"
se6.prepareText(session,'Lexalytics is hased in Hmherst, MA.')
corrected = SE6.correctOCRErrors(session, [], 1.0)
      
# This will correct only "hased"
se6.prepareText(session,'Lexalytics is hased in Hmherst, MA.')
attributes = [{"char_offset": 14, "confidence": 0.0}]
corrected = SE6.correctOCRErrors(session, attributes, 0.5)

se6.closeSession(session)

prepareCollectionFromList

Summary

Prepares the contents of a list for processing. This function, or its sister function prepareCollectionFromFile, must be called every time you want to process a different set of related pieces of text. The list must consist of individual strings that are either 7bit ASCII or UTF8.

This method provides a wrapper around the underlying C API method lxaPrepareCollection.

Syntax

salience6.prepareCollectionFromList(oSession, sName, lstContent)

Parameters

oSession

A SalienceSession object previously created via openSession

sName

A descriptive name for the collection

lstContent

A list of text strings to process as a collection of related content

Returns

Returns an integer return code. Possible error return codes are specified on the Errors and Warning Codes page.

Example

import salience6 as se6
    session = se6.openSession('/path/to/license.v5','/path/to/data')
    # Prepare the collection
    content = []
    content.append("The cruise was excellent and service was great.")
    content.append("I found the temperature in the dining rooms very cold.")
    ...
    ret = se6.prepareCollectionFromList(session, 'myCollection', content)
    if (ret==0):
        ...extract results from text...
    else:
        if (ret==6):
            print se6.getLastWarnings(session)     
    se6.closeSession(session)

prepareCollectionFromFile

Summary

Prepares the contents of a file for collection processing. This function, or its sister function prepareCollectionFromList, must be called every time you want to process a different set of related pieces of text. The file must consist of individual strings that are either 7bit ASCII or UTF8.

This method provides a wrapper around the underlying C API method lxaPrepareCollectionFromFile.

Syntax

salience6.prepareCollectionFromFile(oSession, sName, sPath)

Parameters

oSession

A SalienceSession object previously created via openSession

sName

A descriptive name for the collection

sPath

A text file containing a list of text strings to process as a collection of related content

Returns

Returns an integer return code. Possible error return codes are specified on the Errors and Warning Codes page.

Example

import salience6 as se6
    session = se6.openSession('/path/to/license.v5','/path/to/data')
    ret = se6.prepareCollectionFromList(session, 'myCollection', '/path/to/aFile.txt')
    if (ret==0):
        ...extract results from text...
    else:
        if (ret==6):
            print se6.getLastWarnings(session)     
    se6.closeSession(session)

Updated 4 months ago

Text Preparation


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.