This release is a major update to Salience, introducing new features and enhancements to existing features. Salience 6 introduces the ability to identify “intentions” within content, aimed at improving Salience use against customer content. Salience 6 also allows improvements to sentiment analysis in cases of mixed sentiment and in content containing lists and tables, more powerful query topic functionality, and the ability to analyze content through multiple user directories containing different customizations in one pass. Another important change to note in Salience 6 is a change to the default values for two existing options. Further details on all features and enhancements are provided below.
- New: Intentions
- New: Multi-configuration support
- New: DOCX ingest
- New: Classification model support
- New: Alternate forms
- Sentiment enhancements
- Query enhancements
- Entity extraction enhancements
- Important: Changes to default option values
- Changes to API methods and result structures
Our benchmark for performance has always been 2 documents per second per server core, where a document is up to 4kb of paragraph text, with processing for each document consisting of text preparation, entity extraction, sentiment extraction, theme extraction, and summarization with our default processing settings.
In preparation for release, Salience 6 was timed against an internal benchmark content set of news articles. Average processing time across this content set is 274ms per document, achieving our expected benchmark. This is a 5% increase over Salience 5.2, which was clocked at 260ms on the same API calls on the same content set.
Performance impacts due to other new features such as the query enhancements and the WITH operator are dependent on the complexity of query definitions, and the number of hits the queries generate against content. In general, the WITH operator has the same general performance profile as the NEAR operator.
Salience 6 introduces the identification of intentions within document content. In the initial release, Salience 6 will recognize buy, sell, quit, or recommend intentions. As with other features of the product, intention recognition can be tuned to a specific content set or extended through customization.
IMPORTANT: Access to intention functionality is controlled via the license file. Although Salience 5.x license files can be used with Salience 6, existing customers will need to contact Lexalytics for updated license files in order to access intention functionality.
Salience 6 provides support for setting up multiple "configurations" for content processing in the same Salience session. There are two primary use cases we aimed at supporting with this new capability:
- Cases where customers wish to run the same content through multiple customization sets, without swapping user directories constantly and having to reprocess the content, or
- Cases where customers may direct content to be processed by a specific customization set, but need multiple customization sets to be active and available to service the needs of their content set.
With Salience 6, a "configuration" can be added to a Salience session, which consists of a user directory containing customized data files. Content is processed once, and results can be retrieved for one or more specific configurations by name. Most options, such as query topic files, entity thresholds, concept topic thresholds, etc. can also be configuration-specific.
NOTE: Certain options cannot be set on a per-configuration basis. These are the base options that control the language, text threshold, use of shared memory, processing content as a single sentence, flattening upper-case content, etc.
Salience 6 supports ingesting the Microsoft Word .docx format directly, one of the most common file formats in the business world. Salience 6 supports this format as a native content for text analytics, maintaining information about document sections and formatting for future use.
In addition to document classification possible through query-based topics, concept-based topics, and auto-categorization, Salience 6 supports the use of generic classification models. The initial release of Salience 6 ships with an example Naïve-Bayes model for classifying document content. An update to the Salience Workbench will be provided that contains tools for customers to create their own classification models. Currently, only Naïve-Bayes is supported, additional model types will be added in the future.
As Salience has been used to analyze more and more non-journalistic content, we have introduced techniques to accommodate human error in content. Note, this is not spell-correction! Salience does not auto-correct words that are detected as possible misspellings or slang jargon. Instead, Salience 6 includes a datafile that provides alternate forms for these words to allow the core Salience text analytics processes to operate on the alternate forms for better POS tagging, and thus better higher-order results.
Sentiment is often expressed within content as a mixed bag of positive and negative statements. For example,
The restaurant was in a great location and the food was good, but the service made the evening a disaster.
Today was suppose to be a great day but I guess not #disappointed
Although the content initially states positive sentiment, it is the negative sentiment at the end of the statement that holds more weight and conveys the true message. Salience 6 includes new logic for handling content similar to the example, where there are inflections, but not through direct negations, of sentiment within the same sentence. We call these discourse modifiers. The Salience 6 distribution contains a default data file of discourse modifiers, and the weighting that should be applied when their use is detected in connection with sentiment shifts.
Additionally, Salience 6 includes the handling of sentiment expressed in lists and tables. Previous Salience releases allowed application developers to detect lists and tables, and turn off sentiment analysis of content contained within these sections to avoid skewed results. With Salience 6, table and list headers are used as cues to the sentiment for the content contained within these structural elements.
The enhancements to sentiment analysis in Salience 6 will show a different in content that has a mix of sentiment, where previously the combination of positive and negative sentiment within a sentence or tweet would cancel each other out, but human judgment weighted different sentiment phrase based on where they occurred. In an internal test of short-form content, we observed a 3 point increase in F1 for positive and negative sentiment over Salience 5.2, where the increases in precision and recall were directly related to Salience 5.2 considering a piece of content as neutral, but Salience 6 is able to match the human expectation of non-neutral sentiment.
Extensions have been made to the query syntax used within Salience for query-defined topics to allow the existence of entities, document-level sentiment, or entity-level sentiment to be used as query criteria.
Salience 6 also introduces a new operator to the query syntax for query-defined topics, the WITH operator. This operator requires that the query terms the WITH operator is used for appear within the same sentence in the content.
The Salience entity extraction model supports the extraction of People, Places, Companies, and Products. In addition, List and Pattern type entities are supported in the out-of-the-box data files. What Salience 6 adds is the ability for customers to define their own entity types simply by adding a folder named with the desired entity type to a user directory.
Company extraction is very important to our customers, but this extends beyond strictly corporate entities to organizations, associations, political parties, and government agencies. The out-of-the-box datafiles have been updated to aid in the extraction of these company-like entities.
The enhancements to entity extraction are not expect to result in large improvements in overall entity extraction F1. What customers can expect is increased company entity extraction of entities that Salience did not previously consider company entities. Similarly, the ability to define new entity types will not show increases in entity extraction out-of-the-box, but provides customers with increased flexibility in their customizations.
Changes have been made to the default values for the following options:
Entity Options – Entity Topics
The default value for this option is 0 (off) for better out-of-the-box performance on entity extraction. Set this option to 1 (on) to obtain corresponding topics in entity extraction results.
Theme Options – Theme Topics
The default value for this option is 0 (off) for better out-of-the-box performance on theme extraction. Set this option to 1 (on) to obtain corresponding topics in theme extraction results.
As with every major Salience release, there are some changes to API methods and result structures.
The majority of the changes in the Salience 6 API are backward-compatible. Changes to API methods to enable the use of multiple configurations have been implemented in the wrappers through method overloading and where applicable default method parameters. Thus, code written against the Salience 5.2 API for an operation such as getting document sentiment will function in the same manner against the Salience 6 API, returning document sentiment for the default configuration. Changes need only be made to take advantage of the new multiple configuration capability, calling the new overloaded API method that takes a configuration name as an extra parameter.
There are changes to some return structures in Salience 6 that will require changes in code written against the Salience 5.x API.
A migration guide is available to document the changes to API methods and result structures.
Updated about a year ago