.../data/tagger
For users comfortable with regular expressions and the concepts behind a Part of Speech (POS) tagger, you can modify the basic part of speech tagging to handle complex text strings like hyphenated product names, dates and other special text strings.
For each file, click on the filename for more detailed information below.
| A list of words with known POS tags |
---|---|
| A pattern file applied during tokenization |
| A case-sensitive list of words that should not be split into individual tokens |
| A list of extensions used in file and website URI (uniform resource identifiers) |
Refer to the List of supported POS tags for information about the part-of-speech tags used within the data files for Salience Engine.
customlexicon.dat
This file allows you to override the POS tagging for individual words. It uses case-sensitive exact matching. Each entry is of the form:
<word><tab><POS tag>
For example:
eBay NNP
iPhone NNP
www.foo.com URL
gluepatterns.ptn
The glue file allows the user to glue-together items that are split apart by the tokenizer, overriding their POS tag at the same time. The syntax of the glue file is as follows:
{pattern}=>{LABEL}
For example to recognize the smiley face :-) in this sentence you'd build a glue pattern like the following:
(=':') (='-') (=')')=>SMILEY
The result of a glue statement like that shown above is a part of speech tagged document with a special marker for smiley faces. Given the sentence:
"Bob Smith is one happy camper :-)"
You get the following tagged text:
Bob/NNP Smith/NNP is/VBZ one/CD happy/JJ camper/NN :-)/SMILEY ./.
The same concept can be applied to any specialty text that the user wishes to tag, like Dates or product names.
Note: If no tag (like SMILEY
) is given to the string then it will be tagged using the normal tagging rules (lexicon lookup falling back on a maximum entropy model).
gluetokens.dat
This is a case-sensitive list of words that should not be split up by the tokenizer when they appear contiguously in the text, which can also impact how words are tagged. For example:
x-box
x-ray
P&G
This will prevent the string x-box
from being tokenized as x
, -
, box
causing it instead to produce a single token.
You can make an entry case insensitive by prefixing it with a ~ so
~p&g
would make all occurrences of P&G be tokenized as a single token regardless of case.
uri_suffixes.dat
This data file provides a list of common Internet domain suffixes and file format extensions.
Updated about 2 years ago