You can create and use taxonomies in Ascribe Rule Sets to help with common text manipulation chores, such as cleaning comments.

In this article I will explain how to use a taxonomy in a Rule Set.  We will use the taxonomy to correct common misspellings of words that are important to the text analytics work we are doing.  If you are new to Rule Sets you can learn about them in Introduction to Ascribe Rule Sets.

A taxonomy contains a list of groups, each of which has a list of synonyms.  The taxonomy can be used to find words or phrases and map them to the group.  In our case the group will be the properly spelled word, and the synonyms the misspelled version of the group.

Rule Sets do not have direct access to the taxonomies stored in Ascribe, the ones on the Taxonomies page in Ascribe.  The Rule Set can however create and use its own taxonomies.  These are for use only while the Rule Set is running, and are not stored on the Taxonomies page in Ascribe.

Setting up a Taxonomy

You create a Taxonomy object in a rule like so:

var taxonomy = new Taxonomy();

You can now add groups and synonyms to the taxonomy.  To add a Group to the taxonomy we write:

var group = taxonomy.AddGroup("technical");

The variable group holds the new group, the text of which is “technical”.  We can now add regular expression synonyms to the group for common misspellings of “technical”.

group.AddRegex("tecnical");
group.AddRegex("technecal");

We can improve this by adding the \b operator at the start and end of our regular expression patterns, to make sure that our regular expressions match only whole words:

group.AddRegex("\btecnical\b");
group.AddRegex("\btechnecal\b");

One of the nice things about using the Taxonomy object is that you can use the extensions to regular expressions provided by Ascribe.  The \b operator makes the regular expression patterns hard to read.  We can replace these with the < and > operators provided by Ascribe for start and end of word:

group.AddRegex("<tecnical>");
group.AddRegex("<technecal>");

And, if you are well up on your regular expressions, you realize that we could do this with a single regular expression pattern:

group.AddRegex("<tecnical>|<technecal>");

but we will continue with this simple example using two synonyms for purposes of this discussion.

Because this pattern of “add group, then add synonyms to the group” is so common, the Taxonomy object supports a more compact syntax to add a group with any number of synonyms:

taxonomy.AddGroupAndSynonyms("technical", "<tecnical>", "<technecal>");

Here the first parameter is the group text, which may be followed by any number of regular expression patterns for the synonyms.

Using a Taxonomy

We can ask whether any synonym in the taxonomy matches a given piece of text using the IsMatch method of the Taxonomy object.  For example, we may set up a taxonomy with synonyms for words or phrases we want to call out as alerts.  One way to use this would be to inject findings into the text analytics results contain sensitive words.  Here is an Add Finding from Finding rule that demonstrates this:

var taxonomy = new Taxonomy();
// Add a list of sensitive words
// We don’t care about the group text, so we use an empty string
taxonomy.AddGroupAndSynonyms("", "<sue>", "<lawyer>", "<suspend>", "<cancel>");

if (taxonomy.IsMatch(f.t)|| taxonomy.IsMatch(f.e)) {
  f.t = "Alert!";
  return f;
}

Here we add a new finding with the topic “Alert!” if the topic or expression contains one of the sensitive words we are looking for.

We can also use this technique to remove comments that contain nonsense character sequences at load time.  Here is a Modify Response on Load that will remove comments that contain popular nonsense character sequences for respondents using QWERTY keyboards:

var taxonomy = new Taxonomy();
taxonomy.AddGroupAndSynonyms("", "asdf", "qwer", "jkl;");
if (taxonomy.IsMatch(f.r)) {
  return null; // If nonsense found comment is not loaded
}

Returning to our original desire to correct common misspellings when data are loaded, we can use the powerful Replace method of the Taxonomy object.  Given a string of text, the Replace method will replace any text that matches a synonym in the taxonomy with the group text.

Here is an example Modify Response on Load rule:

var taxonomy = new Taxonomy();
taxonomy.AddGroupAndSynonyms("technical", "tech?n[ia]ca?l");
taxonomy.AddGroupAndSynonyms("support", "suport");

f.r = taxonomy.Replace(f.r);

This rule will change the comment:

I called tecnacl suport.

to:

I called technical support.

Performance Considerations

The examples in the last section create and populate a new Taxonomy object each time the rule is invoked.  With small taxonomies such as found in the examples above this is just fine.  But if the taxonomy is large this can slow down the performance of your Rule Set dramatically.  A much better approach when using large taxonomies is to create a Class rule to hold the taxonomy.  This can provide much better performance with the additional benefit of making the taxonomy available in all rules in the Rule Set.  See my post Using Class Rules in an Ascribe Rule Set for more information.

Summary

Using a Taxonomy object in a Rule Set can simplify common tasks such as detection or replacement of certain words or phrases.  Taxonomies have the added benefit of allowing the use of an extended regular expression syntax for word matching.