.../data/salience/entities

The entities directory contains a set of subfolders that drive entity extraction of available entity types. Customizing entity extraction is done through overriding or creating additional data files in the equivalent locations in a user directory.

  • salience/entities subfolders
  • Common data files
  • Customizing entity extraction

salience/entities subfolders

<Lexalytics root>/data/salience/entities

companies: Entities that represent businesses

lists: Simple list-based entities

patterns: Entities identified by recognizable patterns (phone number)

people: Entities that represent people

places: Entities that represent geographical locations

products: Entities that represent products

queries: Query-defined entities

regex: Entities extracted via regular expressions

Common data files

There are several data files that are found commonly throughout the folders that control extraction of entities. These can be added to the equivalent subfolders within a user directory.

exclude.dat

Data file for excluding certain text from entity recognition
NOTE: if you create a copy of this data file in a user directory, you must also copy score.dat

normalization.dat

Used for normalizing multiple forms of which an entity can occur to a single form
NOTE: if you create a copy of this data file in a user directory, you must also copy score.dat

rules.ptn

Pattern rules for configuring entity extraction

score.dat

Additional configuration for entity extraction

*.cdl

Customer-defined list files for specific entities to be extracted

Customizing entity extraction

1: Simple augmentation through CDL files
By default, entities are identified and typed via the entity extraction model, which has been trained via thousands of examples of company names, person names, etc. in context. However, if there are specific entities of a certain type that are known to exist in content that are not being extracted via the model, these can be added to a customer-defined list (CDL) file. These are discussed in detail on the page: Customer-Defined Lists (CDL).

2: Normalizing different forms of an entity
In some cases, often with companies, an entity may be referred to with different forms (eg. Microsoft, Microsoft Corp., Microsoft Corporation). Multiple forms can be normalized into a single output of the entity in results through the use of a normalization.dat file. This is a simple two column tab-delimited file with the various forms in one column and the official form in the other column. An example is provided below:

Microsoft<tab>Microsoft Corp.
Microsoft Corp<tab>Microsoft Corp.
Microsoft Corporation<tab>Microsoft Corporation

NOTE: As stated above, if you copy a normalization.dat file in a user directory, you must also copy a score.dat file.

3: Pattern-based entity extraction
Certain entity type folders contain additional pattern files that are used to perform complex pattern-based entity recognition. Additional pattern files can be added to a user directory, where the contents contain instructions using the pattern syntax defined on the page: Pattern Files