Preprocess Files

/data/salience/preprocess.dat enables you to replace text in your document with different text before it is tokenized. This is useful if you are trying to detect long entities (you can replace them with a different token) clean up HTML etc.

NOTE: When adding this file to a data directory on Linux, please ensure that the filename is preprocess.dat (all lower-case).

The format of the file is:


It is important to note that the replacements are carried out in order, so if you have replacements for 'the swimming team' and 'the swimming teams' then you should ensure that 'the swimming teams' is before 'the swimming team' otherwise the text will already have been replaced. You cannot use multiple preprocess files: all preprocess text must share a file.