It is just a fact of life that the only way to know the correct spelling of a word is to look it up in a dictionary. For example, there is no way to guess that "iPhone" should be spelled with a lower-case "i" and an upper-case "P": you need to be explicitly told that this is the correct spelling. Salience comes with just such a comprehensive dictionary of ordinary words, abbreviations, and proper nouns (names of people, companies, places, etc.) that is used to correct OCR errors. Nevertheless, as new words enter the language ("twerking") or new products or companies appear ("Bitcoin"), customers may find that they need to add new entries to the Salience OCR correction dictionaries. This is done through auxiliary dictionaries stored as .aux files in the data/user/correction directory. The following gives guidelines about how to create and organize .aux files.
A customer-defined auxiliary dictionary is defined in a file with the file extension ".aux" and stored in the data/user/correction directory. An .aux file has the following file format:
word1<tab>rank1 word2<tab>rank2 word3<tab>rank3 ...
Rank reflects the frequency of occurrence of words. In English, the word "the" has a rank of 1, since it is the most frequently occurring word. It is the only English word with a rank of 1. The words "of", "and", "to"', "in" and "a" are the only English words with a rank of 2. The list below shows a sampling of words at their respective ranks.
1. the 2. of, and, to, in, a 3. was, by, not, I, or, ... 4. all, can, been, may, them, ... 5. many, through, same, even, ... 6. give, others, political, side, best, business, ... 7. factors, temperature, understanding, spirit, region, ... 8. wave, parallel, code, possession, drawing, experienced, ... 9. gardens, enhanced, continuity, recommend, delayed, reproduction, ... 10, cartilage, expects, waving, baptized, partisan, fortified, disclosed, socioeconomic, ... 11. banishment, intractable, indicted, indiscriminately, breakers, blunder, mediocre, ... 12. microwaves, vitriolic, deactivation, bedecked, constabulary, vocalist, teething, ... 13. listlessness, sleazy, asters, deportees, rhapsody, camcorder, nonacademic, ... 14. gibbering, transcendentalist, spellbinding, hobbyist, pocketknife, hooligan, ... 15. quadratics, pedicab, infiltrator, yuck, carafes, muscatel, lethargically, ovular, ... 16. schlepping, mummify, zaniness, counteroffensive, simpers, musclebound, ... 17. nonchargeable, creampuff, electrologist, outdrawn, pluckier, millipede, ... 18. brownnose, splayfoot, groundcloth, uneasiest, iodize, mudflaps, bellyflop, ... 19. bionically, choppiest, dumfounds, stargazed, schmaltzy, ...
The reason for belaboring the notion of ranks is that you need to pick a rank for all words you add to your auxiliary dictionary so that the OCR error correction behaves reasonably. The good news is that you don't need to be precise about the rank. In general, avoid assigning ranks from 1 to 6, since this is the domain of English function words. A rank of between 9 and 16 is a sweet spot for most words, with 12 being a good default rank. If you think the word is too specialized (e.g., "titration") or a passing fad (e.g., "twerking"), feel free to bump it down slightly in the list of rankings, perhaps giving it a rank of 14 rather than 12. (Just like life, higher numbers means lower rankings: 3rd place in a contest is less important than 1st place.) If you think the word has legs and will become more commonly used (e.g., "tweet"), you can bump it up in the rankings slightly, perhaps giving it a 10 rather than the default 12. Don't be overly concerned about the choice of rank: any sensible value will give you good OCR error correction.
When adding a new word to the auxiliary dictionary, you should add all inflected forms of the word. For example, to add the word "twerk", include all the following entries:
twerk<tab>14 twerks<tab>14 twerking<tab>14 twerked<tab>14
You may find it useful to maintain multiple dictionaries for different types of words. For example, the entries for "twerking" could be included in an auxiliary dictionary named "vocabulary.aux" or "slang.aux". (The name of the dictionary doesn't matter, as long as its extension is ".aux"). You might also choose to maintain separate auxiliary files for abbreviations, people, companies and organizations, places and domain-specific terminology and abbreviations. Here are some examples of different types of words and some good default ranks for each.
Sample dictionary entry
domain-specific abbreviation (e.g., financial)
More commonly, you will be adding names (persons, places, organizations, etc.) rather than vocabulary to their auxiliary dictionaries. There is a little bit of an art to this. If it is a multiword name, generally you should add only the uncommon words in the name to the dictionary. For example, if the name is "Wintrust Financial", you would add the following entries to the dictionary:
Wintrust Financial\<tab>8 Wintrust\<tab>8
Note that that the word "Financial" has not been added as a separate entry since it is a common word included in the base dictionary. Also note that that, unlike ordinary vocabulary word, the entries reflect the expected capitalization for the names. Finally, note that the entire name is also included as an entry along with its constituent uncommon words.
Sometimes it is appropriate to enter only the entire name in the dictionary, even if the it contains some uncommon words. The most usual case for this with names from other languages. For example, the name "Isle au Haut" would have only the following entry in the dictionary:
Isle au Haut\<tab>8
Why not add "au" or "Haut" also? Because these are very unlikely words (in English, at least), and introducing them into the vocabulary runs the risk creating misguided OCR error correction. For example, it might cause "ao" to be corrected to "au" rather than the more likely "an", or "Hau1" to be corrected to "Haut" rather than the more likely "Haul". These situations are rare, however, and you can expect adding names and vocabulary to auxiliary dictionaries to be generally straightforward.
Updated 6 months ago