.../data/salience/entities
The entities
directory contains a set of subfolders that drive entity extraction of available entity types. Customizing entity extraction is done through overriding or creating additional data files in the equivalent locations in a user
directory.
- salience/entities subfolders
- Common data files
- Customizing entity extraction
salience/entities subfolders
<Lexalytics root>/data/salience/entities
companies
: Entities that represent businesses
lists
: Simple list-based entities
patterns
: Entities identified by recognizable patterns (phone number)
people
: Entities that represent people
places
: Entities that represent geographical locations
products
: Entities that represent products
queries
: Query-defined entities
regex
: Entities extracted via regular expressions
Common data files
There are several data files that are found commonly throughout the folders that control extraction of entities. These can be added to the equivalent subfolders within a user
directory.
| Data file for excluding certain text from entity recognition |
---|---|
| Used for normalizing multiple forms of which an entity can occur to a single form |
| Pattern rules for configuring entity extraction |
| Additional configuration for entity extraction |
| Customer-defined list files for specific entities to be extracted |
Customizing entity extraction
1: Simple augmentation through CDL files
By default, entities are identified and typed via the entity extraction model, which has been trained via thousands of examples of company names, person names, etc. in context. However, if there are specific entities of a certain type that are known to exist in content that are not being extracted via the model, these can be added to a customer-defined list (CDL) file. These are discussed in detail on the page: Customer-Defined Lists (CDL).
2: Normalizing different forms of an entity
In some cases, often with companies, an entity may be referred to with different forms (eg. Microsoft, Microsoft Corp., Microsoft Corporation). Multiple forms can be normalized into a single output of the entity in results through the use of a normalization.dat
file. This is a simple two column tab-delimited file with the various forms in one column and the official form in the other column. An example is provided below:
Microsoft<tab>Microsoft Corp.
Microsoft Corp<tab>Microsoft Corp.
Microsoft Corporation<tab>Microsoft Corporation
NOTE: As stated above, if you copy a normalization.dat
file in a user
directory, you must also copy a score.dat
file.
3: Pattern-based entity extraction
Certain entity type folders contain additional pattern files that are used to perform complex pattern-based entity recognition. Additional pattern files can be added to a user
directory, where the contents contain instructions using the pattern syntax defined on the page: Pattern Files
Updated over 2 years ago