Sample relationship pattern

Here is a line from the Occupation relationship that ships with Salience:

{[email protected]} (='is')(=',') (DT)? *? (?<position> (#~Occupation-Prefixes.dat)? (#~Occupations.dat) ((='of') (NN)(NNS)(NNP)(NNPS)(JJ) (CC)? (NN)(NNS)(NNP)(NNPS)(JJ)? (NN)(NNS)(NNP)(NNPS)?)? ) ( ((=’,’) (#~Occupation-Prefixes.dat)? (#~Occupations.dat))* (=’,’)? (='and') (#~Occupation-Prefixes.dat)? (#~Occupations.dat) )? (='of')(='at')(='for')(='in') *? *? {[email protected]}

Note that patterns do not need to be this complex. Many simple patterns can also effectively match sentences. Now, let's break the pattern up in to smaller pieces to see what's going on:

{[email protected]}

A person has an occupation at a company. The first construct in this pattern is {[email protected]}, which means the pattern must begin with a Person entity. Note that this doesn't mean that every sentence that matches needs to start with a Person. A pattern will often match only part of a sentence. If we wanted to require the person name to start the sentence, we would write ^ {[email protected]}. ^ means start of a sentence.

(='is')(=',')

The word after the person's name must be "is" or a comma. Here we are trying to match sentences like "Bob is CEO", or "Bob, CEO, …" Because we didn't put spaces between the two tokens, the pattern will match a sentence that uses either "is" or a comma.

(DT)? *?

DT stands for determiner. These are words like "a", "the", "some". We put a question mark after it because "Bob is CEO" and "Bob is the CEO" are both valid. The next symbol ? says any other word can appear afterwards. ? lets us account for the extra words we add to sentences. For example, "Bob is a great CEO" or "Bob is my CEO" will still match because of the *? symbol . "Bob is the single greatest and most friendly, bar none, of all the CEOs that I know" will not match. If we had said ** that sentence would have matched, but we risk matching incorrect sentences like "Bob is a poor worker, but he's good friends with the CEO". Wildcards are helpful for matching flowery language, but can reduce the precision of a sentence. The balance between precision and recall with wildcards is a tricky part of building relationships, and part of the reason testing your relationships is so important.

(?

Here is a capture. The following text captures what exactly this person is. Since job is not currently set up as an entity we match it with a capture instead of an entity.

(#~Occupation-Prefixes.dat)? (#~Occupations.dat)

Occupation-Prefixes contains words like Sr. and Assistant, that could be added to most job titles. The ~ means we want to match the dictionary case insensitive. The ? at the end means a prefix isn't necessary. Occupations.dat contains a list of many different occupations a person could have. This part is required, and again is case-insensitive.

((='of') (NN)(NNS)(NNP)(NNPS)(JJ) (CC)? (NN)(NNS)(NNP)(NNPS)(JJ)? (NN)(NNS)(NNP)(NNPS)?)?

People often have job titles that reference their responsibility or a branch of the company, such as Vice President of Product Management, or Director of Public Relations. We could list all these possible responsibilities, but because there are so many it's helpful to match them grammatically. First, note that this clause is in parentheses and has a ? outside it. That means a sentence can match it, but doesn't have to. If it does match it, it must start with the word "of". The next word can be a noun (NN), plural noun (NNS), proper noun (NNP), plural proper noun (NNPS) or adjective (JJ). This first word is not optional (except in terms of the whole clause being optional). We don't want to match "Vice President of".

After the first word of the title, we can optionally have a conjunction (CC). This lets us match people with multiple responsibilities, like "Vice President of Sales and Marketing".

Finally, we optionally repeat the noun or adjective clause to match longer titles. The final clause does not accept an adjective, as this is a prepositional phrase and therefore must end in a noun.

)

Finally, we close the capture clause. Any text after this won't be considered part of the job title. Again we've made a trade off between being precise about what makes a job title, and missing very long titles. As an alternative, we could have built a new entity type that matches occupations instead of finding it as a capture in the relationship.

( ((=’,’) (#~Occupation-Prefixes.dat)? (#~Occupations.dat))*

This optional clause will match additional job titles. Many executives have multiple positions in a company, like "President, Chairman and CEO". Here we have a decision. By including this clause within the capture, we can recognize the string of titles as one occupation. Or, as we did here, we can recognize a relationship multiple times, saying this person is President, is Chairman, and is CEO. Doing it this way will require us to write additional patterns that capture this part of the relationship. The above line recognizes any number of job titles separated by commas.

(=’,’)? (='and') (#~Occupation-Prefixes.dat)? (#~Occupations.dat) )?

The final job title must be separated by "and" or ", and". The star in the previous clause meant that we could match the comma separated jobs 0 times, but notice that there is no * or ? immediately after (#~Occupations.dat). If there are multiple job titles, we must end on “and {Title}”. There is a ? after the entire clause, though. This makes the list feature of the job pattern optional, so we can still match an individual job. ** would also have allowed us to match a pattern even if the job title was followed by more occupations, but would also have matched other patterns. By specifying precisely what is allowed to follow a job title we improve our results.

(='of')(='at')(='for')(='in')

After the job title, we need a preposition telling us that this person works this job for this company. The above are the four prepositions commonly used to express this relationship.

? ? {[email protected]}

Again we add a couple wildcards to give some variety in how the sentence is written. Depending on the source of your documents you may see sentences that express ideas succinctly and don't need wildcards to catch extra words, or you may see adjective and adverb filled sentences that fail to fit into tight patterns. Again, note that just because the pattern ends the sentence doesn't have to. If we needed the company to end the sentence we can add a $ after it.

This pattern is one of 16 used to express the concept of occupation. In general, you will need at least as many patterns as there ways to order the entities in your sentences. The above pattern did not get written in its full complexity in one pass. In general, you'll rewrite a pattern many times. Start simple, and see which relationships the pattern misses and which incorrect relationships it picks up. Then figure out the most general way of including or excluding those patterns.