Text Preparation

API Reference to functions related to preprocessing, modifying, and OCR correction for individual and collections of text files.

lxaPrepareText

Summary: Prepares a piece of text for processing. This function, or its sister function lxaPrepareTextFromFile, must be called every time you want to process a different piece of text. Text should either be 7-bit ASCII or UTF-8.

Returns: Integer return code. LXA_OK when text preparation is successful and LXA_OK_WITH_WARNINGS when the text is malformed in some way. Sentences that exceed 1000 words will cause function to return with LXA_ERROR.

To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.

More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.

Notes: Words that exceed 366 characters in length will be truncated. This is twice the length of the longest English word that is not a chemical compound.

Function Signature:
int lxaPrepareText(SalienceSession *pSession, const char *acText);

Parameter

Description

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

acText

Pointer to a char buffer containing the text to be prepared

Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.

/* Assuming a SalienceSession (*pSession) has already been opened with 
  a valid LexalyticsLicense */

  char *acBuffer = "This is some text to process";

  returnCode = lxaPrepareText(pSession, acBuffer);
  if(returnCode == LXA_ERROR) {
    return 1;
  } else if(returnCode == LXA_OK_WITH_WARNING) {
      return 2;
    }

  // ...

lxaPrepareTextFromFile

Summary: Prepares the text contents of a file for processing. This function, or its sister function lxaPrepareText, must be called before every time you want to process a different piece of text. Text should either be 7-bit ASCII or UTF-8.

Return: Integer return code. LXA_OK when text preparation is successful and LXA_OK_WITH_WARNINGS when the text is malformed in some way. Sentences that exceed 1000 words will return with LXA_ERROR.

To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.

More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.

Notes: Words that exceed 366 characters in length will be truncated. This is twice the length of the longest English word which is not a chemical compound.

Function Signature:
int lxaPrepareTextFromFile(SalienceSession *pSession, const char *acFile);

Parameter

Description

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

acFile

Pointer to a null-terminated string containing the path to the file you want to process

Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.

/* Assuming a SalienceSession (*pSession) has already been opened with 
  a valid LexalyticsLicense */

  char *pFile = "path/to/file/goes/here";

  returnCode = lxaPrepareTextFromFile(pSession, pFile);
  if(returnCode == LXA_ERROR) {
    return 1;
  } else if(returnCode == LXA_OK_WITH_WARNING) {
      return 2;
    }

  // ...

lxaAddSection

Summary: Adds the supplied text into a section of the document for analysis. The supplied text should be UTF-8 encoded, not UTF-16.

Return: Integer return code.

To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.

More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.

Function Signature:
int lxaAddSection(SalienceSession *pSession, const char *acHeader, const char *acText, int nProcess);

Parameter

Description

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

acHeader

Pointer to a char buffer containing the name of the section the text should be added to

acText

Pointer to a char buffer containing the text to be added to the section

nProcess

Integer value that determines whether or not the section is processed. Set this value to 1 to mark it for processing. Set it to 0 so that it is not processed. This is useful if you are using a section for metadata versus actual text data.

Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.

/* Assuming a SalienceSession (*pSession) has already been opened with 
  a valid LexalyticsLicense */

  char *headerName = "header-name-of-section";
  char *textData = "People who insist on picking their teeth with their elbows are so annoying!"

  // set nProcess argument to 1 to process text
  if(lxaAddSection(pSession, header, text, 1) != LXA_OK) {
    return 1;
  }

  // ...

lxaAddSectionFromFile

Summary: Adds the text from the supplied file into a section of the document for analysis. The supplied text should be UTF-8 encoded, not UTF-16. This is similar to lxaAddSection, but takes it's input from a file.

Return: Integer return code. To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.

More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.

Function Signature:
int lxaAddSectionFromFile(SalienceSession *pSession, const char *acHeader, const char *acFile,int nProcess);

Parameter

Description

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

acHeader

Pointer to a char buffer containing the name of the section the text should be entered into

acFile

Pointer to a char buffer containing the path to a text file to be added to the section

nProcess

Integer value that determines whether or not the section is processed. Set this value to 1 to mark it for processing. Set it to 0 so that it is not processed. This is useful if you are using a section for metadata versus actual text data.

Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.

/* Assuming a SalienceSession (*pSession) has already been opened with 
  a valid LexalyticsLicense */

  char *headerName = "header-name-of-section";
  char *fileName = "path/to/file"

  // set nProcess argument to 1 to process text
  if(lxaAddSection(pSession, headerName, fileName, 1) != LXA_OK) {
    return 1;
  }

  // ...

lxaCorrectOCRErrors

Summary: Corrects likely OCR errors found in prepared text. This function is useful when processing scanned documents that that have been converted to electronic format through optical character recognition; there is no need to call it if the document originated from an electronic format (word processing, web page, tweet, etc.). If used, it should called after calling text preparation functions (lxaPrepareTextFromFile and lxaPrepareText), but before any text analytics functions (that is, any of the other functions that are typically called after preparing text).

Return: Integer return code. LXA_OK if OCR correction was successful. If failure was caused by this function being called either before the text has been prepared for analysis using lxaPrepareText or after you have performed a text analytics function, then LXA_API_ABUSE will be returned.

To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.

More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.

Function Signature:
int lxaCorrectOCRErrors(SalienceSession *pSession, OCRCharacterAttributeList *pOCRCharacterAttributes, float fConfidenceThreshold, char **acCorrectedText);

Parameter

Description

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

pOCRCharacterAttributes

Pointer to a OCRCharacterAttributeList structure, which identifies suspect characters in the text. Only words containing these characters are subject to correction. If this list is either null or not null, but empty, all characters in the document text are treated as candidates for correction

fConfidenceThreshold

Confidence below which a character in pOCRCharacterAttributes will be subject to OCR correction. It is used only when pOCRCharacterAttributes is not null and not empty

acCorrectedText

Pointer to a char buffer containing the corrected text

Notes: When the argument pOCRCharacterAttributes is null or an empty list, this function will check every character in the document to determine whether it needs to be corrected. However, some OCR engines produce confidence information about for characters that they believe might have been misrecognized. If this information is available, you can pass this information along in the the pOCRCharacterAttributes argument along with a confidence cutoff in fConfidenceThreshold. In this case, only those characters identified in pOCRCharacterAttributes whose confidence values fall below fConfidenceThreshold will be corrected. This will speed up the OCR error correction process.

Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.

/* Assuming a SalienceSession (*pSession) has already been opened with 
  a valid LexalyticsLicense */

  char *acBuffer = "This is zome tex1 to process";

  // This will correct both "zome" and "tex1"
  lxaPrepareText(pSession, acBuffer);
  char *acCorrected;
  lxaCorrectOCRErrors(pSession, nullptr, 0.0f, &acCorrected);

  // This will correct only "zome"
  lxaPrepareText(pSession, acBuffer);
  OCRCharacterAttribute oAttr;
  oAttr.nCharOffset = 8;
  OCRCharacterAttributeList oAttrList;
  oAttrList.nAttributes = 1;
  oAttrList.pAttributes = &oAttr;
  lxaCorrectOCRErrors(pSession, &oAttrList, 1.0f, &acCorrected);

  // ...

lxaPrepareCollection

Summary: Prepares the contents of a SalienceSession structure for processing. This function, or its sister function lxaPrepareCollectionFromFile, must be called every time you want to process a different set of related pieces of text. Text within individual members of collection should either be 7-bit ASCII or UTF-8.

Returns: Integer return code. LXA_OK when text preparation for the collection is successful and LXA_OK_WITH_WARNINGS when the text is malformed in someway. Sentences that exceed 1000 words will return with LXA_ERROR.

To retrieve more information on warnings and errors use lxaGetLastWarnings on the SalienceSession structure.

Function Signature:
int lxaPrepareCollection(SalienceSession *pSession, SalienceCollection *pCollection);

Parameter

Description

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession.

pCollection

Pointer to a SalienceCollection structure containing the related pieces of text you want to process.

Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.

/* Assuming a SalienceSession (*pSession) has already been opened with 
  a valid LexalyticsLicense */

  vector<char*> myVector(0);
  myVector.push_back(acText1);
  myVector.push_back(acText2);

  int nSize = myVector.size();

  SalienceCollection oCollection;
  oCollection.acCollectionName = "MyCollection";
  oCollection.nSize = nSize;
  oCollection.pDocuments =
    (SalienceCollectionDocument*)malloc(sizeof(SalienceCollectionDocument) * nSize);

  for(int i=0; i < nSize; i++) {
    oCollection.pDocuments[i].acText = myVector[i];
    oCollection.pDocuments[i].acIndentifier = "myDoc";
    oCollection.pDocuments[i].nIsText = 1;
    oCollection.pDocuments[i].nSplitByLine = 0;
  }

  if (lxaPrepareCollection(pSession, &pCollection) != LXA_OK) {
    return 1;
  }

  // ...

lxaPrepareCollectionFromFile

Summary: Prepares the contents of a file as a collection for processing. This function, or its sister function lxaPrepareCollection, must be called every time you want to process a different set of related pieces of text. The contents of the file are expected to be 7-bit ASCII or UTF-8.

Returns: Integer return code. LXA_OK when text preparation for the collection is successful and LXA_OK_WITH_WARNINGS when the text is malformed in someway. Sentences that exceed 1000 words will return with LXA_ERROR.

To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.

More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.

Function Signature:
int lxaPrepareCollectionFromFile(SalienceSession *pSession, const char *acCollectionName, const char *acFile);

Parameter

Description

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

acCollectionName

Pointer to a null-terminated string containing a descriptive name for the collection

acFile

Pointer to a null-terminated string containing the path to the file you want to process

Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.

/* Assuming a SalienceSession (*pSession) has already been opened with 
  a valid LexalyticsLicense */

  char *acPath = "/path/to/content";
  SalienceSession *pSession;

  SalienceCollection oCollection;
  oCollection.nSize = 1;
  oCollection.acCollectionName = "MyCollection";

  if(lxaPrepareCollectionFromFile(pSession,oCollection.acCollectionName,acPath) != LXA_OK) {
    return 1;
  }

  // ...