Text Preparation

lxaPrepareText

Summary

Prepares a piece of text for processing. This function, or its sister function lxaPrepareTextFromFile, must be called every time you want to process a different piece of text. Text should either be 7bit ASCII or UTF8.

Syntax

int lxaPrepareText(SalienceSession *pSession, const char *acText);

Parameters

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

acText

Pointer to a char* buffer

Returns

This function returns an integer return code.

LXA_OK

Text preparation was successful

LXA_OK_WITH_WARNINGS

The text is malformed in some way. You should call [lxaGetLastWarnings][6] to find out the specifics of the warning.

Notes

Words that exceed 366 characters in length will be truncated. This is twice the length of the longest English word which is not a chemical compound.

Sentences that exceed 1000 words will cause lxaPrepareText to return with LXA_ERROR.

Example

char* acBuffer = "This is some text to process";
         ...
         SalienceSession* pSession;
         if(lxaOpenSalienceSession(oLicense, &oStartup, &pSession) != LXA_OK)
              return 1;
         lxaPrepareText(pSession,acBuffer);
         ...

lxaPrepareTextFromFile

Summary

Prepares the text contents of a file for processing. This function, or its sister function lxaPrepareText, must be called every time you want to process a different piece of text. Text should either be 7bit ASCII or UTF8.

Syntax

int lxaPrepareTextFromFile(SalienceSession *pSession, const char *acFile);

Parameters

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

acFile

Pointer to a null-terminated string containing the path to the file you want to process

Returns

This function returns an integer return code.

LXA_OK

Text preparation was successful

LXA_OK_WITH_WARNINGS

The text is malformed in some way. You should call lxaGetLastWarnings to find out the specifics of the warning.

Notes

Words that exceed 366 characters in length will be truncated. This is twice the length of the longest English word which is not a chemical compound.

Sentences that exceed 1000 words will cause lxaPrepareTextFromFile to return with LXA_ERROR.

Example

char* acPath = "/path/to/content";
         ...
         SalienceSession* pSession;
         if(lxaOpenSalienceSession(oLicense, &oStartup, &pSession) != LXA_OK)
              return 1;
         lxaPrepareTextFromFile(pSession,acPath);
         ...

lxaAddSection

Summary

Adds the supplied text into a section of the document for analysis. The supplied text should be UTF-8 encoded, not UTF-16.

Syntax

int lxaAddSection(SalienceSession *pSession, 
                  const char *acHeader,
                  const char *acText,
                  int nProcess);

Parameters

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

acHeader

Pointer to a char* buffer containing the header for the section

acText

Pointer to a char* buffer containing the text for the section

nProcess

Returns

This function returns an integer return code.

lxaAddSectionFromFile

Summary

Adds the text from the supplied file into a section of the document for analysis. The supplied text should be UTF-8 encoded, not UTF-16.

Syntax

int lxaAddSectionFromFile(SalienceSession *pSession, 
                          const char *acHeader,
                          const char *acFile,
                          int nProcess);

Parameters

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

acHeader

Pointer to a char* buffer containing the header for the section

acFile

Pointer to a char* buffer containing the path to a text file for the section

nProcess

Returns

This function returns an integer return code.

lxaCorrectOCRErrors

Summary

Corrects likely OCR errors found in prepared text. This function is useful when processing scanned documents that that have been converted to electronic format through optical character recognition; there is no need to call it if the document originated from an electronic format (word processing, web page, tweet, etc.). If used, it should called after calling text preparation functions (lxaPrepareTextFromFile and lxaPrepareText), but before any text analytics functions (that is, any of the other functions that are typically called after preparing text).

Syntax

int lxaCorrectOCRErrors(SalienceSession *pSession, OCRCharacterAttributeList* pOCRCharacterAttributes, float fConfidenceThreshold, char** acCorrectedText);

Parameters

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

pOCRCharacterAttributes

Pointer to a OCRCharacterAttributeList structure, which identifies suspect characters in the text. Only words containing these characters are subject to correction. If this list is either null or not null, but empty, all characters in the document text are treated as candidates for correction.

fConfidenceThreshold

Confidence below which a character in pOCRCharacterAttributes will be subject to OCR correction. It is used only when pOCRCharacterAttributes is not null and not empty.

acCorrectedText

Pointer to a char* pointer. The corrected string will be assigned to this char* pointer.

Returns

This function returns an integer return code.

LXA_OK

OCR correction was successful

LXA_API_ABUSE

OCR correction failed because it was called either before text preparation or after calls to text analytics functions. You should call lxaGetLastWarnings to find out the specifics of the error.

Notes

When the argument pOCRCharacterAttributes is null or an empty list, this function will check every character in the document to determine whether it needs to be corrected. However, some OCR engines produce confidence information about confidence information for characters that they believe might have been misrecognized. If this information is available, you can pass this information along in the the pOCRCharacterAttributes argument along with a confidence cutoff in fConfidenceThreshold. In this case, only those characters identified in pOCRCharacterAttributes whose confidence values fall below fConfidenceThreshold will be corrected. This will speed up the OCR error correction process.

Example

char* acBuffer = "This is zome tex1 to process";
    ...
    SalienceSession* pSession;
    if(lxaOpenSalienceSession(oLicense, &oStartup, &pSession) != LXA_OK)
         return 1;

    // This will correct both "zome" and "tex1"
    lxaPrepareText(pSession,acBuffer);
    char* acCorrected;
    lxaCorrectOCRErrors(pSession, nullptr, 0.0f, &acCorrected);

    // This will correct only "zome"
    lxaPrepareText(pSession,acBuffer);
    OCRCharacterAttribute oAttr;
    oAttr.nCharOffset = 8;
    OCRCharacterAttributeList oAttrList;
      oAttrList.nAttributes = 1;
    oAttrList.pAttributes = &oAttr;
    lxaCorrectOCRErrors(pSession, &oAttrList, 1.0f, &acCorrected);
    ...

lxaPrepareCollection

Summary

Prepares the contents of a [SalienceCollection][4] structure for processing. This function, or its sister function [lxaPrepareCollectionFromFile][3], must be called every time you want to process a different set of related pieces of text. Text within individual members of collection should either be 7bit ASCII or UTF8.

Syntax

int lxaPrepareCollection(SalienceSession *pSession, SalienceCollection *pCollection);

Parameters

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

pCollection

Pointer to a SalienceCollection structure containing the related pieces of text you want to process

Returns

This function returns an integer return code.

LXA_OK

Text preparation was successful

LXA_OK_WITH_WARNINGS

The text is malformed in some way. You should call lxaGetLastWarnings to find out the specifics of the warning.

Example

vector<char*> myVector(0);
         myVector.push_back(acText1);
         myVector.push_back(acText2);
         ...
         int nSize = myVector.size();
         ...
         SalienceSession* pSession;
         if(lxaOpenSalienceSession(oLicense, &oStartup, &pSession) != LXA_OK)
              return 1;
         SalienceCollection oCollection;
         oCollection.acCollectionName = "MyCollection";
         oCollection.nSize = nSize;
         oCollection.pDocuments =
         (SalienceCollectionDocument*)malloc(sizeof(SalienceCollectionDocument) * nSize);
    
         for(int i=0; i < nSize; i++)
         {
              oCollection.pDocuments[i].acText = myVector[i];
              oCollection.pDocuments[i].acIndentifier = "myDoc";
              oCollection.pDocuments[i].nIsText = 1;
              oCollection.pDocuments[i].nSplitByLine = 0;
         }
    
         if (lxaPrepareCollection(pSession, &pCollection) != LXA_OK)
              return 1;
         ...

lxaPrepareCollectionFromFile

Summary

Prepares the contents of a file as a collection for processing. This function, or its sister function lxaPrepareCollection, must be called every time you want to process a different set of related pieces of text. The contents of the file are expected to be 7bit ASCII or UTF8.

Syntax

int lxaPrepareCollectionFromFile(SalienceSession* pSession, 
                                 const char* acCollectionName, 
                                 const char* acFile);

Parameters

pSession

Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession

acCollectionName

Pointer to a null-terminated string containing a descriptive name for the collection

acFile

Pointer to a null-terminated string containing the path to the file you want to process

Returns

This function returns an integer return code. The most common return codes are shown below, see the [Errors and Warning Codes][6] page for other error return codes.

LXA_OK

Text preparation was successful

LXA_OK_WITH_WARNINGS

The text is malformed in some way. You should call lxaGetLastWarnings to find out the specifics of the warning.

Comments

TODO: insert information about structure of valid file for use with lxaPrepareCollectionFromFile

Example

char* acPath = "/path/to/content";
         ...
         SalienceSession* pSession;
         if(lxaOpenSalienceSession(oLicense, &oStartup, &pSession) != LXA_OK)
              return 1;
         SalienceCollection oCollection;
         oCollection.nSize = 1;
         oCollection.acCollectionName = "MyCollection";
    
         if(lxaPrepareCollectionFromFile(pSession,oCollection.acCollectionName,acPath) != LXA_OK);
              return 1;
         ...