The themes directory contains the files that control theme extraction at all levels; documents, entities, and collections.
The following files may be customized by users in the themes
section of a user
directory. Click the name of the file for more information below.
| Part-of-speech patterns that define theme extraction |
---|---|
| A list of words that should not be considered for themes |
| Rules for relating themes together |
rules.ptn
This controls the POS rules that determine if a combination of words is a theme or not. It is uses the Pattern File format.
stopwords.dat
This file is used to eliminate phrases that would match the POS rules contained within rules.ptn but are too common to be considered useful, last week for example.
The file is a single column .dat
file. It can contain both single words and phrases (multi-word)
Single words will act as a stop on any phrase containing them:
hello
will stop any phrase appearing that contains the word hello
Phrases will act as a stop on that particular phrase:
next week
will stop next week, it will not stop sometimes next week
NOTE: stopwords.dat is case insensitive.
normalization.dat
NOTE: Salience does NOT ship with a normalization.dat by default.
If you create a normalization.dat, it is possible to normalize multiple different themes into the same theme. This is useful if you want to do some sort of roll-up. For example, you could normalize poor sound, great sound and good quality speakers into ''audio quality''.
To enable theme normalization create a normalization.dat under /data/user/themes with each entry in the format:
- [theme][normalized_form]
NOTE: theme can either be the unstemmed or stemmed form
Updated 11 months ago