Salience

The Salience Developer Hub

Welcome to the Salience developer hub. You'll find comprehensive guides and documentation to help you start working with Salience as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started

Pattern Syntax

Overview

The Lexalytics Pattern Language is used to match textual and grammatical patterns in various places within the Lexalytics' code base. The most notable uses are for entity recognition, entity relationships, and opinions. This document describes the complete pattern syntax, but the basic operators are:

(TEXT) to match the part of speech of a token (see Part-of-speech Predicates below for more details)

(='text') to match the text of a word (case-sensitive)

(=~'text') to match the text of a word (case-insensitive)

(=~s~'text') to match the text of a word (stemmed)

(regex='text') to use a regular expression

Quantifiers can be added to adjust the match criteria:

? makes a predicate optional

+ matches 1 or more times

* matches 0 or more times

@@{n,m}** matches between n and m times (inclusive)

To create a list of optional predicates that could be matched, do not put a space between two tokens, e.g. (='this')(='that') to match 'this' or 'that'. Adding a space: (='this') (='that') would match 'this' followed by 'that'

More documentation is provided for the following pattern language functionality:

Pattern Syntax

This describes a partial grammar in EBNF for the patterns that Salience Engine accepts:

pattern

:= | [ "^" ], [ { group }, "<" ], { group }, [ ">", { group } ], [ "$" ]

group

:= | part, [ quantifier ]

| | "(", [ capture ], { group }, ")", [ quantifier ]

capture

:= | "?<", identifier, ">"

part

:= | [ "+" | "-" ], ( "*" | conjunct), [ "/", number ]

| | phrase-dict-predicate

conjunct

:= | [ "!" ], disjunct, [ "&", conjunct ]

disjunct

:= | predicate, { predicate }

predicate

:= | literal-predicate | hash-predicate | suffix-predicate | pos-predicate | regex-predicate

literal-predicate

:= | "(=", [ "~" ], "'", literal, "')"

hash-predicate

:= | "(#", [ "~" ], filename, ")"

suffix-predicate

:= | "(%", filename, ")"

pos-predicate

:= | "(", pos-tag, ")"

regex-predicate

:= | "(regex=" ( regex | "'", regex, "'") ")"

phrase-dict-predicate

:= | "[", file-list, "]"

file-list

:= | [ "!" ] [ "~" ] fileglob [ "," file-list ]

quantifier

:= | "?" | "*" | "+"

identifier

:= | { letter | "_" }

number

:= | non zero digit, { digit }

Each pattern part matches against exactly one token (unless it is marked with one of the repetition or option quantifiers -- ?, * or +). Note that pattern parts must be separated by whitespace, but there should be no whitespace between predicates within a pattern part. For example

(NN)(NNS)

matches a single token that is part-of-speech tagged as either a noun (NN) or a plural noun (NNS), while inserting whitespace, as in

(NN) (NNS)

would cause the pattern to match a two-token sequence of a noun followed by a plural noun.

Literal Predicates

The simplest type of pattern predicate matches tokens against a literal string. For example,

(='George') (='Bush')

Would match the literal token sequence "George Bush". An optional tilde (~) after the = will make the match case-insensitive, so for example

(=~'iphone')

will match any capitalization of the word ('iphone', 'iPhone', 'IPHONE', etc.).

NOTE: The use of the tilde (~) operator in pattern files indicates case insensitivity. This differs from its use in other data directory files.
In query syntax for query-defined topics, the tilde (~) operator enforces case sensitivity, not case insensitivity.
In HSD files for phrase-based sentiment, the tilde blacklists a phrase so it is not considered in sentiment calculations.

Wildcards

A wildcard, denoted by *, will match any token. The pattern

(='University') (='of') *

will match things like "University of Massachusetts", "University of Toronto", etc.

Hash Predicates

Patterns can use word dictionaries to match against tokens. The syntax is (#filename), where filename is a path to a file, relative to the directory where the pattern file was located. For example, in the rules.ptn file in the data/salience/entities/people directory, a pattern such as

(#prefixes.dat) (NNP)(NNPS)+

Would match any span of tokens that starts with one of the tokens listed in data/salience/entities/people/prefixes.dat followed by one or more proper nouns. prefixes.dat contains words like Mr., Mrs., Dr., etc., so this pattern would match phrases like Mr. Bush.

Just like with literal predicates, a tilde (~) can be added to perform case-insensitive dictionary matching. So using (#~prefixes.dat) in the previous example would allow any variant on the capitalization such as mr. or MRS..

Phrase dictionaries

To match against a dictionary containing phrases of multiple tokens instead of just individual tokens, enclose a list of dictionaries in square brackets. For example, in the rules.ptn file in the data/salience/entities/companies directory, a pattern such as

[companies.dat] (=',') (#suffixes.dat)+

Would match any span of tokens that starts with one of the phrases listed in companies.dat followed by a comma and one of the tokens listed in suffixes.dat. If companies.dat contains International Business Machines and suffixes.dat contains Inc. and Co., then this pattern would match International Business Machines, Inc. or International Business Machines, Co.

Just like with hash predicates, a tilde (~) can be added to perform case-insensitive dictionary matching. So using [~companies.dat] in the previous example would allow any variant on the capitalization such as international business machines or iNtErNaTiOnAl BuSiNeSs MaChInEs.

Suffix Predicates

These allow matching against the ends of words using (%filename.dat). A pattern like:

(%nameendings.dat)

will match against any token that ends in one of the strings from the file nameendings.dat (which in the default distribution contains the most common last three letters of surnames). Suffix matching is currently always case-insensitive.

Part-of-speech Predicates

The part-of-speech tag of a token can be matched against by enclosing the tag in parentheses. The predicate (NNP), for example, would match tokens that are marked as proper nouns. A full list of supported part-of-speech tags can be found on the following page:

Supported part-of-speech tags

Regex Predicates

Regular expressions can be used for more flexible matching against the text of a token, by specifying a predicate of the form (regex=...) or (regex='...'). The quoted variant must be used if the regex contains an embedded close parenthesis. Note that in the quoted variant, the regex still cannot contain the embedded string '), but the same effect can be achieved by putting the quote in a character class, as in [']). Regexes are matched using the Boost regular expression library, using the default Perl-style syntax options.

The following pattern for example:

(regex='[0-9]{4}\-[0-9]{4}\-[0-9]{4}\-[0-9]{4}')

Matches credit card numbers of the form ####-####-####-####.

Anchors

A ^ at the beginning of the pattern forces the pattern to match starting at the beginning of the text being matched against (in most cases where patterns are used they are matched by sentence, so this would anchor the pattern to the beginning of a sentence). Similarly, a $ at the end of the pattern would anchor it to the end of a sentence (or whatever text was being matched against at the time).

Conjunction

Multiple predicates can be combined into one pattern part, either ored or anded together. If multiple predicates are adjoined with nothing in between, they are ored together. For example, the pattern part (NNP)(NNPS)(='foo') would match any token that either is tagged as a proper noun, or is the literal word "foo".

To and together multiple predicates, separate them with an ampersand (&). The ampersand should have no whitespace around it. For example (NNP)(NNPS)&(='foo') would match the literal token "foo", but only when it is tagged as a proper noun. Note that conjunction binds less tightly than disjunction, so the preceding pattern means something like "(tagged NNP or tagged NNPS) and equals 'foo'" instead of "tagged NNP or (tagged NNPS and equals foo)".

Negation

An ! can be used at the start of a disjunction of predicates to indicate that the part should match the inverse of those specified by the predicate. For example !(NNP)(NNPS) matches any token that is part-of-speech tagged as neither NNP nor NNPS. Note that negation binds tighter than conjunction but not as tightly as disjunction. So !(NNP)(NNPS)&(='foo') matches tokens that are the literal foo but tagged as neither NNP nor NNPS.

NOTE: The use of the exclamation (!) operator in pattern files indicates negation of a pattern match. This differs from its use in other data directory files.
In query syntax for query-defined topics, the exclamation (!) operator prevents a stemmed query match.

Grouping

Multiple pattern parts can be grouped together using parentheses. This allows multiple parts to be included in quantifiers.

Quantifiers

Pattern parts or groups of parts can be appended with any of the three traditional regex quantifiers (?, *, +) to represent optional or repeatable parts of the pattern. As usual, ? marks a group that can match 0 or 1 times, * marks a group that can match 0 or more times and + marks a group that can match 1 or more times. For example, to match sequences of hyphenated words such as run-of-the-mill, the following pattern could be used:

(* (='-'))+ *

Postfixed predicates

Any pattern part can be prefixed with + or - to indicate that the matched token either must or must not, respectively, have immediately followed the preceding token in the original text, with no intervening whitespace. For example, to match ... McDonalds' ... but not ... McDonalds '..., the following pattern could be used:

(='McDonalds') +(=''')

Pre- and Post-match Parts

Any pattern parts preceding the < character are considered pre-match parts, while any parts following the > character are considering post-match parts. Pre- and post-match parts restrict where patterns can match, but are not actually included in the match. For example:

<(='Cisco')> !(='Kid')

Would match occurrences of the literal word Cisco, but only when the word that follows it is not Kid. Note that this pattern won't match Cisco at the end of a sentence however, since it requires that it be followed by some word that is not Kid. To get around this, two patterns can be used, as in:

<(='Cisco')> !(='Kid')
(='Cisco')$

To match Cisco when it is following by a word other than Kid, or when it occurs at the end of a sentence.

Maximum Length

Pattern parts that are appended by a / followed by a number specify a maximum length for the token to match. Note that the length is specified in bytes, which for UTF-8 sequences is not necessarily the same as the number of characters. For example

(NNP)(NNPS)/2

matches proper nouns that are at most 2 bytes long.

Named Captures

Parts of the text matched by a pattern can be captured by enclosing them in (?<name> ... ). This is primarily useful in combination with the mechanism for manually scoring entities. For example, in data/companies/score.dat, a line such as:

(?<mycapture> *+) (=~'inc')(=~'inc.') normalized=mycapture

Could be used to cut off Inc. from the end of any company names for the normalized form. This will cause e.g. Google Inc. to be normalized to just Google.

Pattern Order

In general, each pattern is processed in the order it occurs in the pattern file, and later patterns cannot include tokens matched by earlier patterns. This is because a token can only be part of one entity or theme. Because of this, if one pattern is a subset of another, you should always put the longer pattern first in your file. As an example, the following is a pattern from the themes/rules.ptn file: (JJ) (NN)(NNS)+. This matches an adjective followed by one or more nouns or plural nouns. If you wanted to restrict this to 2 word phrases, you could remove the plus. If you want to restrict it to 2 or 3 word phrases you could use the following:

(JJ) (NN)(NNS) (NN)(NNS)

(JJ) (NN)(NNS)

If you reversed the order of those patterns, the 2 word subphrase would always be found first and marked as a theme, leaving the 3 word phrase as invalid.

Alternative Literal Match Syntax

Certain uses, particularly patterns, require large groups of literal text matches, and repeatedly using (=~'text') can be slow and difficult to read. For these situations an alternative syntax exists. You will still need to use the basic syntax when creating case-sensitive patterns, or matching punctuation.

Text expressed outside of any other pattern is now considered an insensitive literal match. For example:

this is good news 

Will match the string 'this is good news'.

You can use the +,* and ? operators directly on tokens:

this is very* good news

Will match the previous text, as well as 'this is very very good news'.

Finally, the previous way of communicating a disjunction (putting two predicates next to each other without a space) does not work with the new syntax, as putting two tokens next to each other creates a run-on word. The | symbol is used instead for disjunctions, as in:

this is very* good|great|wonderful news

Which will also match 'this is very great news'.

Parentheses can be put around a disjunct list:

this is (very|extremely)* good news

but should not be put around an individual token, as that creates the POS tag syntax.

Macros

If you find yourself creating linguistically detailed patterns (usually for relationships), macros are a way of reusing common subpatterns. Use the syntax @macronamepattern to define a macro, and then use (@macroname) in a pattern to reference it. If you want to use the same macros in multiple files, put them all in one file and then put @import filename to include in other files. The macro file and the pattern using it must exist in the same directory.

Macros can also take arguments, for example: @list(member=(NNP)) (member) ((=',') (member))+ (=',')? (=',')(=~'and') (member) matches a list of something, proper nouns by default. If you used (@list(VB)) in a pattern, you'd instead match a list of verbs.

Extracted named entities

Using entity types and labels within patterns
Within certain pattern files used to support Salience functionality, you will see pattern components that are specific to using the existence of named entity types within patterns. For example:

{[email protected]} (='(') {[email protected]} (=')')

The identifiers Company and Place are used to signify that the pattern requires a Company entity, followed by an opening parenthesis, then a Place entity, and a closing parenthesis. In the out-of-the-box pattern files, you will see the standard entity types of Company, Person, Place, Product used in patterns, but any entity type or entity label can be used for the identifier.

Note: You can also exclude an entity type match using the following syntax: {!Person}. This would match anything other than a Person entity.

Index number for reporting order
In relationship patterns, you will see an entity type followed by an index number. In the example above, the relationship results would report the Company entity first, and the Place entity second. To take this example a step further, let's consider the following two patterns which capture Person entities that may be a Plaintiff and Defendant in a court case but expressed in two different orders:

{[email protected]} (='sues') {[email protected]}

{[email protected]} (='was sued by') {[email protected]}

In both cases, the use of the index number ensures that the Plaintiff is reported first in the relationship results. The index number for each entity must be unique and increment starting at 1.

Additional qualifiers for named entity types
There are additional qualifiers that can be added to each named entity type or label in a pattern to control the pattern match. These are:

tokens=N
: States that the entity mention must be N tokens long

start or end
: States that the entity mention must occur as the first or last token in the sentence

next of
: States requirements based on what the next entity type that is detected in the sentence. For example:

{[email protected]} ** {[email protected],next of Company} means that after a Person entity, only the next Company can be matched.

{[email protected]} ** {[email protected], next of Place} would match any Company appearing before the first subsequent Place.

Updated about a year ago

Pattern Syntax


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.