Query Grammar

Reference and examples for query grammer rules used by Salience.

:pushpin: : Important Notes on Queries
Queries are made up of terms, phrases, and operators
Queries cannot be empty
Default query maximum length: 1,500 characters (contact us, to adjust this)
Queries may be nested and chained together using multiple operators
Queries may be referenced from another query using ^
Query validation checklist can be a useful tool if you have encountered a hard to find issue with a query

Overview

The query grammer utilized in Salience is made up of a few basic components, which can be combined to become more complex. These components include:

Operators: An extended set of boolean operators and their compositions that control the logic of a query

Term: A single word for which text data is queried, cannot be stopwords or operators unless wrapped in quotes

  • Example: pizza

Phrase: Multiple words enclosed in quotes for which text data is queried

  • Example: "#1 pizza w/ breadticks combo"

The basic query syntax is as follows:

( term | phrase ) OPERATOR ( term | phrase )

Example:

"popular toppings" AND pizza

Stringing together queries with multiple operators is allowed:

( term | phrase ) OPERATOR ( term | phrase ) OPERATOR ( term | phrase ) ...

Example:

cars AND "japanese manufacturer" NOT toyota

Nesting queries is allowed:

( query | term | phrase ) OPERATOR ( query | term | phrase ) ...

Example:

wine AND ( "dinner entrees" NEAR italian ) NOT ravioli

Operators

:pushpin: : Important Notes on Operators
Operators must be all uppercase in order to be recognized
Operators must also be preceded by a term or phrase and followed by a term or phrase
Consists of: OR, AND, NOT, NEAR, NOTNEAR, WITH, NOTWITH, EXCLUDE

OR Operator: Inside a query, the OR operator may be used to retrieve documents containing either of two terms

Example Query

Hit (True)

Miss (False)

onions OR cheese

"Onions make my eyes water"
"My favorite cheese is cheddar."
"I want cheese and onions on my pizza".

"I like cheddar and ham sandwiches."
"Do you want pepper on your pizza?"

AND Operator: Inside a query, the AND operator may be used to retrieve documents containing both specified terms:

Example Query

Hit (True)

Miss (False)

onions AND cheese

"I want cheese, onions and ham on my pizza"

"Onions make my eyes water"
"My favorite cheese is cheddar."

NEAR Operator: The NEAR operator is effectively an AND operator where you can control the distance between the words. Just as with AND both terms must be in the document, however NEAR adds a distance limit. You can vary the distance the NEAR operation uses by adding a number suffix after a / such as NEAR/50. This number can be between 0 and 99. The default distance is 10 words, as such, using NEAR is the same as writing NEAR/10.

NOTE: Salience releases prior to Salience 5.2 do not allow NEAR/0.

Example Query

Hit (True)

Miss (False)

onions NEAR cheese

"Do you want onions on top of your cheese pizza?"

"Their cheese is my favorite, but it does not come on the dish with caramelized onions."

(onions OR bananas) NEAR/5 (cheese OR dinner)

"The banana split was included with dinner"
"The steak dinner with onions was my favorite."

"The cheese platter on the dinner menu was superb."
"Bananas, strawberries, and ice cream are not a balanced dinner."

WITH Operator: The WITH operator requires that the two terms occur within the same sentence. Punctuation in your documents will impact this operator, so if your content often has misused punctuation this is not the ideal operator for your content. Take the example below, the difference between a comma and a period is the deciding factor on if this query hits or not.

Example Query

Hit (True)

Miss (False)

onions WITH cheese

"I have onions, would you like me to put some on your cheese pizza?"

"I have onions. Would you like me to put some on your cheese pizza?"

NOT Operator: The NOT operator excludes any documents containing the term which follows it. A query must contain at least one non-excluded term when using the NOT operator.

Example Query

Hit (True)

Miss (False)

onions NOT celery

"Onions make my eyes water"
"I like onions very much"

"I like onions on my sandwich and celery on the side."

NOTWITH Operator: The NOTWITH operator requires that the two terms cannot occur within the same sentence. As with the WITH operator, punctuation will impact this operator. If your content misuses punctuation this is not the ideal operator for your content.

NOTE: Operator added in Salience 6.2.

Example Query

Hit (True)

Miss (False)

onions NOTWITH cheese

"I have onions. Would you like me to put some on your cheese pizza?"

"I have onions, would you like me to put some on your cheese pizza?"

NOTNEAR Operator: The NOTNEAR operator is effectively a NOT operator where you can control the distance between the two words. Just as with NEAR, you can vary the distance that the NOTNEAR operation uses by adding a number suffix after a / such as NOTNEAR/50. This number can be between 0 and 99. The default distance is 10 words, as such, using NOTNEAR is the same as writing NOTNEAR/10.

NOTE: Operator added in Salience 6.2.

Example Query

Hit (True)

Miss (False)

onions NOTNEAR/5 cheese

"I have onions, would you like me to cut some up for on top of your cheese pizza?"

"I want onions and cheese on my pizza."

EXCLUDE Operator: The EXCLUDE operator is unlike the other operators in that it deals with cases where the query term preceding the operator is usually a part of the query phrase following the operator. The query will return documents with the word term that precedes EXCLUDE, excluding those that only contain the term after. The effect is different than that of the NOT operator.

Example Query

Hit(True)

Miss(False)

York EXCLUDE "New York"

"I spent the day in York, visiting the magnificent cathedral. Then it was time to head back to London for my flight home to New York."

"It was time to head back to London for my flight home to New York."

As the first example contains “York” outside of the context of “New York” this document would hit on the query.

Even though the second example contains “York” (within New York) we are excluding “New York” so the query does not hit.

If we use the same query terms but substitute NOT, we will get different results:

Example Query

Hit (True)

Miss (False)

York NOT "New York"

"I spent the day in York, visiting the magnificent cathedral."

"I spent the day in York, visiting the magnificent cathedral. Then it was time to head back to London for my flight home to New York."

With NOT if we add “New York” to the document the query becomes false. This document does hit if you use EXCLUDE instead.

Note that both NOT and EXCLUDE are not constrained by sentences. They are document wide operators.

Parentheses

Queries can use parentheses to control the logic of the query:

  ((onions OR cheese) AND celery) NOT horrible
  (onions OR cheese) NEAR (horrible OR disgusting)

Notes on use of parentheses:

  1. Every left parenthesis must have a corresponding right parenthesis.
  2. You can nest parentheses up to 10 levels deep.

Subqueries

Queries can be referenced by other queries using a special character (^):

        Subquery1 "Concept Matrix" OR "Concept Matrices"
        Subquery2 Collections AND Facets AND "Additional Language Support"
        MainQuery1  Salience AND "5.0" OR ^Subquery1
        MainQuery2  ^Subquery1 NEAR/10 ^Subquery2

Notes on use of subqueries:

  1. A query must be defined above another query that uses it as a subquery.
  2. A query that is to be referenced by another query cannot contain whitespace in the query label (e.g. Subquery1, not Subquery 1)

Metadata criteria

Curly braces can be used to reference either NLP features (such as document sentiment), or starting in Salience 7 metadata provided along with the document (via AddSection).

To query on NLP features, the syntax is:

{<entity|document> <entity type> : <sentiment criteria>}

The syntax above allows for the first component of the metadata criteria to be either entity or document.

If the first component is entity, it may be followed by an entity_type. This may be any of the entity types supported by the Salience entity extraction model, such as company, person, place, or product. User and Named Entities are both matched.

Optionally, a sentiment criteria component may be added. Sentiment criteria can be a comparison of document or entity sentiment to a single value, or a range.

Based on these specifications, the following metadata query phrases are valid:

"merger announcement" NEAR/5 {entity company}

"merger announcement" AND {document: sentiment > 0.2}

"merger announcement" AND {entity company: sentiment > 0.2}

"merger announcement" AND {entity company: 0 < sentiment < 0.25}

As of Salience 7, you can also write queries against the sections of your document.

lxaAddSection(name, value) includes a named field with the document. This can then be referenced as follows:
{section NPS > 7}

This would check whether an document section with name "NPS" exists, and if so whether it has a value above 7.

The section field can include any boolean statement using the following operators and functions:

Operator/Function

Symbol

Simple Arithmetic Operators

      • / %

Logical AND and OR

&& AND || OR

Logical NOT

&& !, NOT

Bitwise AND and OR

& |

Comparisons

==, <=, >=, <, >, !=

Type Casting

int(X), float(X), string(X)

Ternary If

X ? Y : Z

Casefolding

to_lower(X) or casefold(X)

Substring

X in Y

Default value

default(variable, value_if_not_defined)

There is also a built in variable: document_length (in tokens).

The section node will turn into a boolean value, and so can be part of AND/OR/NOT queries.

By default, any string can be used as a variable name in a section field, and it's value will be interpreted as a string (which can be case to a number like so: int(variable)). If you have a known set of section fields, you can also use a document_model.dat file in the root of your user directory to specify the types of fields Salience can expect it to see.

The format of that file is:
<variable><tab>(int|float|string|any)?<tab>(true|false)?

Where the second column is the valid values for this section field, and the third column is whether it should always be present or if it's optional.

If a document model is present, the following things will happen:

lxaSetSalienceCallback can be set to provide warning messages if a document does not match document_model.dat, either because required variables are missing or a value isn't of the correct type (e.g. passing "cat" into an int field). It will also return warnings to the callback if a query references a variable not in the document model. Finally, the section fields will automatically be interpreted as the specified type, so you don't have to cast to an integer/float to use the variable in a mathematical equation. E.g.:

With document model:
{section current_year - birth_year > 18}

Without document_model:
{section int(current_year) - int(birth_year) > 18}

NOTE: The NEAR and WITH operators assume usage with text-level elements, it is not valid to use the {document: sentiment} or {metadata} construction with these query operators.
"merger announcement" NEAR/5 {entity company} : Valid "merger announcement" NEAR/5 {document: sentiment > 0.2} : Invalid

Query terms and phrases

Single query terms

Single query terms are the simplest query element, consisting of a single word:

`broccoli`

Notes on query terms:

  1. A single query term cannot be a word that appears in a stopword list or an operator, unless wrapped in quotes.
  2. A single query term cannot contain punctuation or other special characters, unless wrapped in quotes.

` ! @ # * $ % ^ ( ) _ = ~ + [ ] { } ( ) | " ' : ; . , < > ? / \

Phrase searches

Phrases may be enclosed in double quotes:

"broccoli cheese"

Notes on phrase searches:

  1. Double quotes must begin and end a query phrase.
  2. When a single word is enclosed in quotes, it is not treated as a phrase search. It is treated like a single word, as if it were not in quotes. (Ex. "broccoli" = broccoli)
  3. The special characters listed above as invalid for single query terms may be used within query phrases provided they are escaped using a \ character.
  4. NOTE: Salience 5.2 and above allow @ and # to be used within quoted query phrases without escaping.
  5. For multi-word phrases searches, only the right-most word is stemmed. The query process will not stem all words within the multi-word phrase. (Ex. "driving on faster roads", will match "driving on faster road-" but will not match "driving on fast- roads")

Wildcards

A wildcard character may be used at the end of a single word query term, or at the end of a phrase. For example:

  • excit* : This would match excite, exciting, excitement, etc.
  • "health agen*" : This would match "health agenda" or "health agency" or "health agencies".

Note, there must be at least a three-letter prefix to a wildcard query:

  • d* : Invalid
  • do* : Invalid
  • dog* : Valid

Negation

By default, a query on dog will hit on normal and negated mentions. "No dogs allowed" is about dogs. In some instances, you may want to restrict a token to only match negated or nonnegated forms: If you want to find hotels with pools, then "There was no pool" isn't a match. Prefixing a term with a + restricts it to only nonnegated matches, while a - restricts it only negated matches. For example:

+happy OR -sad

Would hit the sentences "I am happy" and "I haven't been sad for a while", but not hit "I'm not happy" and "I am sad."

The default is to match both negated and nonnegated if no + or - is specified. If you have a use case that requires mostly nonnegated matches you can set the following option instead of modifying every query token: SALIENCEOPTION_QUERYDEFAULTACCEPTSNEGATED.

Case-sensitivity

By default, query terms are handled in a case-insensitive manner. Case-sensitivity on a query term can be enforced using the ~ operator.

Example Query

Hit (True)

Miss (False)

~Google NEAR/10 Microsoft

"Both tech giants Microsoft and Google are investing heavily in mobile technologies"

"who wins in search, microsoft, bing or google?"

NOTE: The use of the tilde (~) operator in query syntax indicates case sensitivity. This differs from its use in other data directory files.
In pattern files, the tilde (~) operator enforces case insensitivity, not case sensitivity.
In HSD files for phrase-based sentiment, the tilde blacklists a phrase so it is not considered in sentiment calculations

Stemming

By default, query terms are stemmed. As mentioned above, in the case of multi-word phrases, the last term in the phrase is the only term that is stemmed.

Stemming can be turned off for individual words via the stemwords.dat data file.

Stemming can be turned off for an entire query using the ! operator.

!driving AND faster AND roads

NOTE:

The use of the exclamation (!) operator in query topic definitions prevents stemming of query terms. This differs from its use in other data directory files.
In pattern files, the exclamation (!) operator negates a pattern match component.

Globally, stemming can be turned off using the API option SALIENCEOPTION_QUERYTOPICSTEMMING.

Accents/Diacritics

If you want Salience to respect accent/diacritic differences in queries and content, so that the query "mere" does not hit the content "mère", turn the Ignore Accents option off.

If this option is not disabled (the default behavior) then queries without any diacritics will match content that contains them. This can be particularly useful for social media, where speakers are often casual about the use of diacritics. If you include the accents in your query, Salience assumes you want them and will only match content with the same diacritics.

Hashtags/Mentions

Starting in Salience 6.2, hashtags and twitter style mentions are stemmed, e.g. #dogs is stemmed to #dog.

Additionally, starting in Salience 6.2 a query token will match hashtag and twitter style mentions of itself. The query:

truck

will hit both @truck and #truck. Phrase searches do not have this behavior: querying on

"truck"

only hits the literal phrase truck, not #truck or @truck. Finally, querying for the hashtag or mention form only finds that form:

@truck

Will not match #truck or truck, only @truck.

Stopwords

Stopwords, or words that will not be considered for query terms, can be added or removed through the stopwords.dat data file. Use of stopwords in a query term will generate an error on the use of the query.

POS tags

Queries against a token can reference only mentions with a specific POS tag using the "_" character, for example: cook_VB will only match "cook" being used as a verb.

Scoring

Query results will be accompanied with two scores, Query Relevancy and Query Sentiment.

Query Relevance: Query Relevance is a count of the query terms found within a document. It can be particularly effective in determining the effectiveness of your queries based on your text. Consider the following text:

"I have one cat and I used to have a dog too."

The query relevancy score for the query cat OR dog OR bird will be 2 because the query detects two of the query terms.

Query Sentiment: Query Sentiment is the sentiment for each query term identified separately based on model- and dictionary- driven approaches and calculates the average score for all mentioned terms.

Query validation checklist

Salience will generate error messages on invalid query construction. To avoid a large number of the errors that can occur, use the following checklist to validate your queries.

  1. Query is in the format: query-title<tab>query-term(s)
  2. Query length less than 10000 characters
  3. Query cannot be empty
  4. Operators must be capitalized
  5. When specified, the window for a NEAR operator must be between 0 and 99
  6. Every operator must have a term on the left side and right side
  7. Parentheses and quotes must be balanced
  8. Query terms cannot contain invalid characters: ` ! @ # $ % ^ ( ) _ = ~ + [ ] { } ( ) | " ' : ; . , < > ? /
  9. The characters that are invalid for single query terms may be used in phrase terms, but must be escaped using , with the exception of @ and #
  10. Query should not contain stopwords