Text Preparation
API Reference to functions related to preprocessing, modifying, and OCR correction for individual and collections of text files.
lxaPrepareText
Summary: Prepares a piece of text for processing. This function, or its sister function lxaPrepareTextFromFile, must be called every time you want to process a different piece of text. Text should either be 7-bit ASCII or UTF-8.
Returns: Integer return code. LXA_OK
when text preparation is successful and LXA_OK_WITH_WARNINGS
when the text is malformed in some way. Sentences that exceed 1000 words will cause function to return with LXA_ERROR
.
To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.
More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.
Notes: Words that exceed 366 characters in length will be truncated. This is twice the length of the longest English word that is not a chemical compound.
Function Signature:
int lxaPrepareText(SalienceSession *pSession, const char *acText);
Parameter | Description |
---|---|
| Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession |
| Pointer to a char buffer containing the text to be prepared |
Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.
/* Assuming a SalienceSession (*pSession) has already been opened with
a valid LexalyticsLicense */
char *acBuffer = "This is some text to process";
returnCode = lxaPrepareText(pSession, acBuffer);
if(returnCode == LXA_ERROR) {
return 1;
} else if(returnCode == LXA_OK_WITH_WARNING) {
return 2;
}
// ...
lxaPrepareTextFromFile
Summary: Prepares the text contents of a file for processing. This function, or its sister function lxaPrepareText, must be called before every time you want to process a different piece of text. Text should either be 7-bit ASCII or UTF-8.
Return: Integer return code. LXA_OK
when text preparation is successful and LXA_OK_WITH_WARNINGS
when the text is malformed in some way. Sentences that exceed 1000 words will return with LXA_ERROR
.
To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.
More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.
Notes: Words that exceed 366 characters in length will be truncated. This is twice the length of the longest English word which is not a chemical compound.
Function Signature:
int lxaPrepareTextFromFile(SalienceSession *pSession, const char *acFile);
Parameter | Description |
---|---|
| Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession |
| Pointer to a null-terminated string containing the path to the file you want to process |
Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.
/* Assuming a SalienceSession (*pSession) has already been opened with
a valid LexalyticsLicense */
char *pFile = "path/to/file/goes/here";
returnCode = lxaPrepareTextFromFile(pSession, pFile);
if(returnCode == LXA_ERROR) {
return 1;
} else if(returnCode == LXA_OK_WITH_WARNING) {
return 2;
}
// ...
lxaAddSection
Summary: Adds the supplied text into a section of the document for analysis. The supplied text should be UTF-8 encoded, not UTF-16.
Return: Integer return code.
To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.
More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.
Function Signature:
int lxaAddSection(SalienceSession *pSession, const char *acHeader, const char *acText, int nProcess);
Parameter | Description |
---|---|
| Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession |
| Pointer to a char buffer containing the name of the section the text should be added to |
| Pointer to a char buffer containing the text to be added to the section |
| Integer value that determines whether or not the section is processed. Set this value to 1 to mark it for processing. Set it to 0 so that it is not processed. This is useful if you are using a section for metadata versus actual text data. |
Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.
/* Assuming a SalienceSession (*pSession) has already been opened with
a valid LexalyticsLicense */
char *headerName = "header-name-of-section";
char *textData = "People who insist on picking their teeth with their elbows are so annoying!"
// set nProcess argument to 1 to process text
if(lxaAddSection(pSession, header, text, 1) != LXA_OK) {
return 1;
}
// ...
lxaAddSectionFromFile
Summary: Adds the text from the supplied file into a section of the document for analysis. The supplied text should be UTF-8 encoded, not UTF-16. This is similar to lxaAddSection, but takes it's input from a file.
Return: Integer return code. To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.
More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.
Function Signature:
int lxaAddSectionFromFile(SalienceSession *pSession, const char *acHeader, const char *acFile,int nProcess);
Parameter | Description |
---|---|
| Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession |
| Pointer to a char buffer containing the name of the section the text should be entered into |
| Pointer to a char buffer containing the path to a text file to be added to the section |
| Integer value that determines whether or not the section is processed. Set this value to 1 to mark it for processing. Set it to 0 so that it is not processed. This is useful if you are using a section for metadata versus actual text data. |
Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.
/* Assuming a SalienceSession (*pSession) has already been opened with
a valid LexalyticsLicense */
char *headerName = "header-name-of-section";
char *fileName = "path/to/file"
// set nProcess argument to 1 to process text
if(lxaAddSection(pSession, headerName, fileName, 1) != LXA_OK) {
return 1;
}
// ...
lxaCorrectOCRErrors
Summary: Corrects likely OCR errors found in prepared text. This function is useful when processing scanned documents that that have been converted to electronic format through optical character recognition; there is no need to call it if the document originated from an electronic format (word processing, web page, tweet, etc.). If used, it should called after calling text preparation functions (lxaPrepareTextFromFile and lxaPrepareText), but before any text analytics functions (that is, any of the other functions that are typically called after preparing text).
Return: Integer return code. LXA_OK
if OCR correction was successful. If failure was caused by this function being called either before the text has been prepared for analysis using lxaPrepareText or after you have performed a text analytics function, then LXA_API_ABUSE
will be returned.
To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.
More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.
Function Signature:
int lxaCorrectOCRErrors(SalienceSession *pSession, OCRCharacterAttributeList *pOCRCharacterAttributes, float fConfidenceThreshold, char **acCorrectedText);
Parameter | Description |
---|---|
| Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession |
| Pointer to a OCRCharacterAttributeList structure, which identifies suspect characters in the text. Only words containing these characters are subject to correction. If this list is either null or not null, but empty, all characters in the document text are treated as candidates for correction |
| Confidence below which a character in |
| Pointer to a char buffer containing the corrected text |
Notes: When the argument pOCRCharacterAttributes
is null or an empty list, this function will check every character in the document to determine whether it needs to be corrected. However, some OCR engines produce confidence information about for characters that they believe might have been misrecognized. If this information is available, you can pass this information along in the the pOCRCharacterAttributes
argument along with a confidence cutoff in fConfidenceThreshold
. In this case, only those characters identified in pOCRCharacterAttributes
whose confidence values fall below fConfidenceThreshold
will be corrected. This will speed up the OCR error correction process.
Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.
/* Assuming a SalienceSession (*pSession) has already been opened with
a valid LexalyticsLicense */
char *acBuffer = "This is zome tex1 to process";
// This will correct both "zome" and "tex1"
lxaPrepareText(pSession, acBuffer);
char *acCorrected;
lxaCorrectOCRErrors(pSession, nullptr, 0.0f, &acCorrected);
// This will correct only "zome"
lxaPrepareText(pSession, acBuffer);
OCRCharacterAttribute oAttr;
oAttr.nCharOffset = 8;
OCRCharacterAttributeList oAttrList;
oAttrList.nAttributes = 1;
oAttrList.pAttributes = &oAttr;
lxaCorrectOCRErrors(pSession, &oAttrList, 1.0f, &acCorrected);
// ...
lxaPrepareCollection
Summary: Prepares the contents of a SalienceSession structure for processing. This function, or its sister function lxaPrepareCollectionFromFile, must be called every time you want to process a different set of related pieces of text. Text within individual members of collection should either be 7-bit ASCII or UTF-8.
Returns: Integer return code. LXA_OK
when text preparation for the collection is successful and LXA_OK_WITH_WARNINGS
when the text is malformed in someway. Sentences that exceed 1000 words will return with LXA_ERROR
.
To retrieve more information on warnings and errors use lxaGetLastWarnings on the SalienceSession structure.
Function Signature:
int lxaPrepareCollection(SalienceSession *pSession, SalienceCollection *pCollection);
Parameter | Description |
---|---|
| Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession. |
| Pointer to a SalienceCollection structure containing the related pieces of text you want to process. |
Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.
/* Assuming a SalienceSession (*pSession) has already been opened with
a valid LexalyticsLicense */
vector<char*> myVector(0);
myVector.push_back(acText1);
myVector.push_back(acText2);
int nSize = myVector.size();
SalienceCollection oCollection;
oCollection.acCollectionName = "MyCollection";
oCollection.nSize = nSize;
oCollection.pDocuments =
(SalienceCollectionDocument*)malloc(sizeof(SalienceCollectionDocument) * nSize);
for(int i=0; i < nSize; i++) {
oCollection.pDocuments[i].acText = myVector[i];
oCollection.pDocuments[i].acIndentifier = "myDoc";
oCollection.pDocuments[i].nIsText = 1;
oCollection.pDocuments[i].nSplitByLine = 0;
}
if (lxaPrepareCollection(pSession, &pCollection) != LXA_OK) {
return 1;
}
// ...
lxaPrepareCollectionFromFile
Summary: Prepares the contents of a file as a collection for processing. This function, or its sister function lxaPrepareCollection, must be called every time you want to process a different set of related pieces of text. The contents of the file are expected to be 7-bit ASCII or UTF-8.
Returns: Integer return code. LXA_OK
when text preparation for the collection is successful and LXA_OK_WITH_WARNINGS
when the text is malformed in someway. Sentences that exceed 1000 words will return with LXA_ERROR
.
To retrieve more information on the return code use lxaGetLastWarnings on the SalienceSession structure.
More information on errors and warnings can be found in the Errors and Warning Codes section of our documentation.
Function Signature:
int lxaPrepareCollectionFromFile(SalienceSession *pSession, const char *acCollectionName, const char *acFile);
Parameter | Description |
---|---|
| Pointer to a SalienceSession structure previously returned by a call to lxaOpenSalienceSession |
| Pointer to a null-terminated string containing a descriptive name for the collection |
| Pointer to a null-terminated string containing the path to the file you want to process |
Example:
Use lxaLoadLicense to create LexalyticsLicense structure. This will then be given to lxaOpenSalienceSession, which is then used to start a session with Salience.
/* Assuming a SalienceSession (*pSession) has already been opened with
a valid LexalyticsLicense */
char *acPath = "/path/to/content";
SalienceSession *pSession;
SalienceCollection oCollection;
oCollection.nSize = 1;
oCollection.acCollectionName = "MyCollection";
if(lxaPrepareCollectionFromFile(pSession,oCollection.acCollectionName,acPath) != LXA_OK) {
return 1;
}
// ...
Updated about 2 months ago