Along with English, German is an Indo-European Germanic language. As such, many of the same tokenization and core NLP approaches are valid between the two. However, as with any language, German does have certain unique aspects that need to be considered to provide a true natural language approach.
Support for German is distributed as a separate data directory download from the Customer Support portal.
Release notes and versions can be found on the Customer Support portal.
One of the main differences that one notices about German is that more words are capitalized than in English. This creates a problem when attempting to identify proper nouns, as capitalization is a major clue for distinguishing proper nouns in English (and many other languages). Consider the following example:
Die bisherigen Überlegungen sehen vor, ein Museum zum Kalten Krieg in einem Bürokomplex entstehen zu lassen, der auf einer noch unbebauten Fläche am Checkpoint Charlie entstehen soll.
Amongst other words,
Bürokomplex might be referring to a specific office complex, but it's not a specifically named place and thus not a proper noun/named entity.
The approach to addressing this aspect of German is not unique to the language. Throughout our support for multiple languages, rather than relying on massive dictionaries of all the possible common nouns, verbs, and other parts of speech, Salience uses a model-based approach where a large corpus of words are annotated with their parts-of-speech in context. The result, in German, is that the part-of-speech tagger is more conservative in it's tagging of proper nouns on the basis of simply capitalization.
Updated about a year ago