Ascribe Regular Expressions

Ascribe allows the use of regular expressions in many areas, including codes in a codebook and for various search operations.  The regular expressions in Ascribe include powerful features not found in standard regular expressions.  In this article we will explain regular expressions and the extensions available in Ascribe.

Regular Expressions

A regular expression is a pattern used to match text.  Regular expressions patterns are themselves text, but with certain special features that give flexibility in text matching.

In most cases, a character in a regular expression simply matches that same character in the text being searched.

The regular expression

love

matches the same four characters in the sentence:

I loved the food and service.

We will write:

love ⇒ I loved the food and service.

To mean “the regular expression ‘love’ matches the underlined characters in this text”.

Here are a few other examples:

e ⇒ I loved the food and service.

d a ⇒ I loved the food and service.

b ⇒ Big Bad John.

Note in the last example that in Ascribe regular expressions are case insensitive.  The character ‘b’ matches ‘b’ or ‘B’, and the character ‘B’ also matches ‘b’ and ‘B’.

So far regular expressions don’t seem to do anything beyond plain text matching.  But not all characters in a regular expression simply match themselves.  Some characters are special.  We will run through some of the special characters you will use most often.

The . (dot or period) character

A dot matches any character:

e. ⇒ I loved the food and service.

To be precise, the dot matches any character except a newline character, but that will rarely matter to you in practice.

The ? (question mark) character

A question mark following a normal character means that character is optional.  The pattern will match whether the character before the question mark is in the text or not.

dogs? ⇒ A dog is a great pet.  I love dogs!

Character classes

You can make a pattern to match any of a set of characters using character classes.  You put the characters to match in square brackets:

cott[oeu]n ⇒ Cotton shirts, cotten pants, cottun blouse

Both the question mark and character classes are great when trying to match common spelling errors.  They can be particularly useful when creating regular expressions to match brand mentions:

budw[ei][ei]?ser ⇒ BudweiserBudwieser, Budwiser, Budweser

Note that the question mark following a character class means that the character class is optional.

Some character classes are used so often that there are shorthand notations for them.  These begin with a backslash character.

The \w character class matches any word character, meaning the letters of the alphabet, digits, and the _ (underscore) character.  It does not match hyphens, apostrophes, or other punctuation characters:

t\we ⇒ The tee time is 2:00.

The backslash character can also be used to change a special character such as ? into a normal character to match:

dogs\? ⇒ A dog is a great pet. Do you have any dogs?  I love dogs!

Quantifiers

The quantifiers * and + mean that the preceding expression can repeat.  The * means the expression can repeat zero or more times, and the + means the preceding expression can repeat one or more times.  So * means the preceding expression is optional, but if present it can repeat any number of times.  While occasionally useful following a normal character, quantifiers are more often used following character classes.  To match a whole word, we can write:

\w+ ⇒ I dont have a cat.

Note that the apostrophe is not matched.

Anchors

Anchors don’t match any characters in the text, but match special locations in the text.  The most important of these for use with Ascribe is \b, which matches a word boundary.  This is useful for searching for whole words.

\bday\b ⇒ Today is the Winter solstice.  Each day gets longer, with more hours of daylight.

Combined with \w* we can find words that start with “day”:

\bday\w*\b ⇒ Today is the Winter solstice.  Each day gets longer, with more hours of daylight.

or words that contain “day”:

\b\w*day\w*\b ⇒ Today is the Winter solstice.  Each day gets longer, with more hours of daylight.

Two other useful anchors are ^ and $, which match the start and end of the text respectively.

Alternation

The vertical bar character | matches the expressions on either side of it.  You can think of this as an OR operation.  Here is a pattern to match either the word “staff” or one starting with “perso”:

\bstaff\b|\bperso\w+\b ⇒ The staff was friendly.  Your support personnel are great!

Regular expression reference

This is just a glimpse into the subject of regular expressions.  There are books written on the subject, but to make effective use of regular expressions in Ascribe you don’t need to dig that deep.  You can find lots of information about regular expressions on the web.  Here is one nice tutorial site:

RegexOne

The definitive reference for the regular expressions supported in Ascribe is found here:

Regular Expression Language

You can also download a quick reference file:

Regular Expression Language – Quick Reference

Ascribe Extended Regular Expressions

Ascribe adds a few additional operators to standard regular expressions.  These are used to tidy up searching for whole words, and to add more powerful logical operations.

Word matching

As we have mentioned, you can use \b to find whole words:

\bstaff\b|\bperso\w+\b ⇒ The staff was friendly.  Your support personnel are great!

But those \b’s and \w‘s make things hard to read.  Ascribe allows you to use these notations instead:

Notation Standard regex equivalent Meaning
< (\b Start of word
> \b) End of word
<< (\b\w* Word containing …
>> \w*\b) … to end of word

We can now write:

<as> (the word “as”)

<as>> (a word starting with “as”)

<<as> (a word ending with “as”)

<<as>> (a word containing “as”)

Examples:

<staff>|<perso>> ⇒ The staff was friendly.  Your support personnel are great!

<bud> ⇒ Bud, Bud light, Budweiser, Budwieser

<bud>> ⇒ Bud, Bud light, Budweiser, Budwieser

<<ser> ⇒ Bud, Bud light, Budweiser, Budwieser

<<w[ei]>> ⇒ Bud, Bud light, Budweiser, Budwieser

If you look in the table above, you will note that each of the standard regular expression equivalents have a parenthesis in them.  Yes, you can use parentheses in regular expressions, much as in algebraic formulas.  It’s an error in the regular expression to have a mismatched parenthesis.  So, the pattern:

<as

is not legal.  The implied starting parenthesis has no matching closing parenthesis.  This forces you to use these notations in pairs, which is the intended usage.

The use of these angle brackets does not add any power to regular expressions, they just make it easier to read your expressions.  The logical operators I describe next do add significant new power.

Logical operators

We have seen that it is easy to search for one word or another:

<staff>|<personnel>

matches either of these words.  What if we want to search for text that contains both the words “cancel” and “subscription”?  This is not supported in standard regular expressions, but it is supported in Ascribe.  We can write

<cancel>&&<subscription>

This matches text that contains both words, in either order.  You can read && as “AND”.

We can also search for text that contains one word and does not contain another word.  This expression:

<bud>>&~<light>|<lite>

matches text that contains a word starting with “bud” but does not contain “lite” or “light”.  You can read &~ as “AND NOT”.

The && and &~ operators must be surrounded on either side by valid regular expressions, except that the &~ operator may begin the pattern.  We can rewrite the last example as:

&~<light>|<lite>&&<bud>>

which means exactly the same thing.

Pay attention to the fact that these logical operators require that the expressions on either side are valid on their own.  This pattern is illegal:

(<cancel>&&<subscription>)

because neither of the expressions

(<cancel>

<subscription>)

are legal when considered independently. There are mismatched parentheses in each.

 

Pattern prefixes

There are two prefixes you can use that change the meaning of the match pattern that follows them.  These must be the first character of the pattern.  The prefixes are:

* Plain text matching
> Standard regular expression matching, without the Ascribe extensions

These prefixes are useful when you are trying to match one of the special characters in regular expressions.  Suppose we really want to match “why?”.  Without using a prefix gives:

why? ⇒ What are you doing and why?  Or why not?

But with the * prefix we can match the intended word only (with the question mark):

*why? ⇒ What are you doing and why?  Or why not?

The > prefix is useful if you are searching for an angle bracket or one of the Ascribe logical operators:

>\d+ < \d+ ⇒ The algebraic equation 22 < 25 is true.

It helps to know that \d matches any digit.

Summary

Regular expressions are a powerful feature of Ascribe.  Spending a little time getting used to them will pay off as you use Ascribe.  The Ascribe extensions to regular expressions make searching for words easier, and support logical AND and NOT operations.

Leave a Reply

Your email address will not be published. Required fields are marked *