The Chinese language differs greatly from the Western languages we currently support in Salience. The writing system is entirely different and is arguably the most complicated writing system in the world. Thus, we cannot approach Chinese the same way we approach English, we have to approach it from the ground up and this is precisely what Salience does. In this article, we will explore characteristics of the Chinese language that distinguish it both linguistically and implementation-wise from our other languages.
Support for Chinese is distributed as a separate data directory download from the Customer Support portal.
Release notes and versions can be found on the Customer Support portal.
Most Western languages use characters that represent phonetic components to compose words that result in a sound that has meaning. The individual characters themselves have no inherent meaning.
In Chinese, each character is an ideogram that represents meaning and has a corresponding sound associated with it. Although there are certain character radicals that may clue you into the pronunciation, there is no way to know how to pronounce a character in Chinese without having formerly memorized it. So when you form words in Chinese, you are combining different meaningful components that result in another meaning. The characters themselves do have inherent meaning.
Look at how the word "puppy" is composed in English versus Chinese. A teal-colored circle represents a meaningful component, whereas a grayish circle just represents sound and no meaning.
Not all words in Chinese are this literal or straightforward as the example above, but the basis for composition is essentially the same. And even if a word is composed of characters that at first glance don't seem to make sense, there may have been a time when they did, and there is usually some way to justify each component of a word.
Look at how the character 联, meaning "to join" or "to unite" can contribute to the meanings of different words in the diagram below:
You may be getting a sense of how different Chinese is from many other languages and how important it is to account for these differences.
There are hundreds, possibly thousands, of languages/dialects spoken in China today. Some are mutually comprehensible, others are not. In order to maintain consistency, the Chinese language pack for Salience is built for the official dialect of China as named by the Chinese government - Mandarin. This is also the most widely spoken dialect of Chinese. Note that saying "Chinese" is ambiguous because of all the different dialects that exist. When we refer to Chinese we will be referring to Mandarin, unless otherwise specified.
Chinese characters can be written in two forms, traditional or simplified. Traditional, as the name suggests, is an older character set that is officially used in places like Taiwan and Hong Kong. It also tends to be associated with Classical Chinese studies for obvious reasons in that it is closer to how it was actually written at the time. Simplified characters, also as the name suggests, are simpler to write. Simplified characters were created with the hopes of improving literacy and were deemed the official character set of mainland China by the government. With that said, many characters are written exactly the same in traditional and simplified. Below is an example of the same sentence written in traditional and simplified:
English: After he graduates from college, he would like to study abroad in China.
Since the mapping is one-to-one, each character is aligned to its respective character in simplified or traditional. The characters that differ are in orange, which is only about half for this sentence. We can see how the simplified characters have fewer strokes and are therefore less complicated to write than their traditional counterparts.
Internally, Salience software uses simplified characters to analyze text. However, since the mapping between traditional and simplified characters (in Mandarin) is essentially one-to-one, we support both traditional and simplified character sets. The user can specify which character set they would like to use. If you are unsure which is which, Salience will be able to figure it out.
According to reports, Microsoft hopes that with the help of Krikorian’s successful past experiences, they can help the Xbox team to better operate interactive gaming and home entertainment services.
What is one of the first things you notice about the Chinese sentence above? Probably the lack of spaces between words in Chinese. If there are no spaces in Chinese, how do we distinguish words? In English, we can say that word boundaries are determined by spaces. But how do we determine where the spaces go? A more universal definition of a word might be "a complete semantic unit that can exist independently". This definition seems sufficient when we have spaces to fall back on. But what if we don't? In certain cases we may be able to justify different segmentations.
We can imagine a confusing, parallel scenario in English with certain compound nouns. Is it "snowman" or "snow man", and does it even really matter? We usually don't think too much about this; if we see "snow man" we are content to say it is two words and if it's "snowman" we'll say it's one word. In either case, we get the point. "thesnowmanismelting" on the other hand, we are most likely all equally dissatisfied with and would demand some segmentation to get "the snowman is melting". The point we are trying to make here is that in Chinese you will get a lot of "snowman" vs. "snow man" scenarios and we have to just choose. We don't have space conventions to fall back on like we do in many Western languages. What we do have are general guidelines established by researchers that have created large-scale corpora for Chinese and have already made these types of decisions. But we should emphasize that these are just guidelines and not necessarily the only options. We attribute our segmentation standard mainly to those defined by the Chinese Treebank . With that said, the segmentation that Salience arrives at for the original sentence is displayed below with an English gloss.
|according to reports||,||Microsoft||hopes||with the help of||Krikorian||past||functional character (possessive)||successful||experiences|
|,||come||help||Xbox||team||more||good||functional character (adverbial)||operate||Interactive|
As a side note, looking at this segmentation and the corresponding English makes it easier to see how important it is to analyze Chinese from the ground-up like Salience does. The literal translation above is difficult even for a human to understand, let alone a translation engine. Not even a state-of-the-art translation engine, like Google Translate, can translate this sentence accurately. Google translates this sentence as follows:
It is reported that Microsoft is hoping Kerry Collins the past successful experience, the Xbox team to help better operate interactive gaming and home entertainment business.
Compare this with the human translation given at the beginning of this section and you will see some of the obvious and subtle inaccuracies. Our Chinese language pack uses Chinese rules and data to analyze Chinese text. It does not funnel Chinese (or any language for that matter) through a translation engine, because even the state-of-the-art for machine translation is simply not good enough for such detailed analysis, as we saw with this example.
Salience will segment character strings into words, and much of the analysis is done based on these word segments. However, to offer a different perspective on the data, and one that is unique to Chinese, we display n-gram frequencies for Chinese at the character level as opposed to the word level. Given the fuzzy nature of wordhood in Chinese, as well as the fact that characters have meaning independent of the word they are a part of, it made more sense to display it this way. Therefore we show frequencies for 1-grams, 2-grams, 3-grams, and 4-grams. Why up to 4? Most words in Chinese, especially frequently used words, are fairly short. Four-character words tend to represent 4-character idioms, called chengyus, which is discussed in more detail in the Chengyu section.
The part-of-speech tags we used for Chinese are based on those of the Chinese Treebank. We mapped many of the Chinese Treebank tags to equivalent tags in our current POS tagset, which is based off the Penn Treebank for English. However, for POS tags that were unique to Chinese, we retained those and added them to our overall POS tagset. The tags are shown in the chart below; tags unique to Chinese are italicized. For a full set of POS tags identified by Salience, please see the page [Supported part-of-speech tags].
|BA||Ba-construction, unique to Chinese||把|
|DEC||的 in a relative clause||的|
|DER||得 in V-得 constructions||得|
|ETC||等 meaning, etc.||等,等等|
|IN||Preposition||在 in, on, at|
|JJV||Predicative adjective||很快 fast|
|LB||被in passive construction||被|
|M||Measure word||个 most common measure word|
|NN||Common noun||经济 economy|
|NNP||Proper noun||中国 China|
|RB||Adverb||很快 very quickly|
|VB||Verb||进行 to implement|
Many of the unique tags in Chinese refer to characters that are essential to grammatical structures in Chinese but have no equivalent meaning in English, these tags include AS, BA, DEC, DEG, DER, DEV, LB, M, MSP, and SP. JJVs are equivalent to the VA tag in the Chinese Treebank. These are essentially words that are between verbs and adjectives.
Parts-of-speech in Chinese are also fuzzy because we don't have the same surface indicators that we have in languages like English. You might notice that there is no NNS tag (plural noun); this is because there is no explicit indication of number on the noun itself in Chinese. Chinese does not inflect for tense, number, person, etc., so the surface forms stay the same even though the meaning may change slightly. POS-tagging in Chinese is more difficult because we can have the same word as a completely different POS depending on the context. For example 发展 can mean "development" and therefore be a noun in one context, or it can mean "to develop" as a verb in another context. This distinction is by no means ambiguous to a speaker of the language, but it may be a rattling idea to those who speak languages that inflect for such distinctions.
In the [n-gram section], we mentioned that we display up to 4-grams in Salience. The existence of chengyus are one of the reasons for this. Chengyus are 4-character idioms that are a significant and unique part of the Chinese language. There are thousands of them that can convey all sorts of thoughts or scenarios succinctly and poetically in Chinese. They are also very commonly used; you are bound to find at least one chengyu in any given article in Chinese. The following is an example of a chengyu with a gloss for each of the characters:
|root||end||upside down||to place|
Literally this chengyu means something like to invert the root and the tip, and in practice you'd use this to mean something similar to "mixing up cause and effect".
Salience includes a dictionary of several thousand chengyus and uses them to ensure proper segmentation of chengyus as a single unit.
We touched upon the fact that Chinese has very little morphological inflections compared to English. Nouns don’t change form to represent the plural, and verbs don't change form to represent tense or person. In general, words in Chinese barely ever change their surface form. Context and functional words instead play the role in deciding many of the grammatical mechanisms that are conveyed by what is often conveyed by inflections on verbs and nouns in English. Because of this fact, the concept of stemming in Chinese is kind of moot. Just to emphasize this point. Look at how the noun stays the same in Chinese while the meaning changes in English.
It is clear from context that we are referring to more than one person in the second example, however the word for "people/person" stays exactly the same.
Similarly, look how the verb in Chinese stays the same in the following examples:
|He already ate.|
|They are eating tonight at 8 o'clock.|
The first sentence refers to a past scenario and the second refers to a future scenario. This is accounted for in English with changes to the form of the verb, but the verb in Chinese stays the same.
Salience recognizes one stemming scenario in Chinese and it has to do with dialectal differences. The Beijing dialect, when spoken, has an "er" sound linked to the ends of many words. This is represented in writing by the character 儿 . This character carries no meaning otherwise, it is purely phonetic in this context. Salience will stem words that end in this phonetic character since the meaning is exactly the same.
An example of a word with and without the "er" sound is shown below.
Both of these mean "West Gate".
Updated 5 months ago