Brand disambiguation using the Ascribe Coder API

Asking respondents to recall brand names is a common survey technique.  Often this is done by asking the respondent to pick brands from a list.  Such aided brand awareness questions can introduce bias.  We can remove this bias by asking the respondent to type the brand name without providing a list.  These unaided brand awareness questions can provide superior results, at the cost of trying to disambiguate the often incorrectly spelled brand names.

In this post I will describe a technique to disambiguate brand mentions automatically.  We will use a taxonomy to map brand misspellings to the correct brand name, then automatically code the responses using a codebook constructed from the taxonomy.  This technique makes use of the Ascribe Coder API.  It stores and automatically codes the brand mentions and returns the disambiguated brands to the API client.  The API client can therefore use these corrected brand mentions for branch logic in the survey.

Summary of the Technique

The Ascribe Coder API can be used to automatically code responses.  But coupled with Rule Sets it can also help automate curating brand mentions, improving the ease and accuracy of auto-coding.

The approach we will take involves these key components:

  • Manually build a taxonomy that groups brand misspellings to the correct brand names.
  • Construct a Rule Set that maps incorrectly spelled brand mentions to the correct brand name using the taxonomy.
  • Using the Ascribe Coder API:
    • Setup up the study and questions in Ascribe.
    • Use the taxonomy to populate the codebook of the unaided brand awareness question.
    • While the survey is in the field, send new responses for the unaided brand awareness question to the API for storage, brand disambiguation, and automated coding.

Brand Disambiguation

Imagine you have an unaided brand awareness question: “What brand of beer did you most recently drink?”  Let’s suppose one of the brands you want to identify is Bud Light.  Your will get responses like:

  • Bud Light
  • Bud Lite
  • Budweiser light
  • Budwiser lite
  • Bud lt
  • Budlite

And many more.  The creativity of respondents knows no bounds!  Ideally you would like to both correctly code these responses as “Bud Light”, but also curate them such that they are all transcribed to “Bud Light”.  How can we make this happen?

Building a Brand Taxonomy

Observe that the list of variations of Bud Light above can be thought of as a set of synonyms for Bud Light.  We can make a taxonomy that maps each of these to Bud Light.  Bud Light is the group, and the variations are the synonyms in that group.  We can do this similarly for all the brands we are considering.

Obtaining the correct brand list

Wikipedia provides this list of brands and sub-brands for Budweiser:

  • Budweiser
  • Bud Light
  • Bud Light Platinum
  • Bud Light Apple
  • Bud Light Lime
  • Bud Light Lime-A-Ritas
  • Budweiser Select
  • Budweiser Select 55
  • Budweiser 66
  • Budweiser 1933 Repeal Reserve
  • Bud Ice
  • Bud Extra
  • Budweiser/Bud Light Chelada
  • Budweiser Prohibition Brew
  • Budweiser NA

These become the groups in our taxonomy.  We need to build synonyms for each group to capture the expected misspellings of the brand.

Creating the Synonyms

When building a taxonomy, it is good practice to start with the more specific brand names and progress to the less specific.  I will demonstrate with the first three brands in the list above.  Start with the most specific brand, “Bud Light Platinum”.  We can construct a synonym to match this brand with these rules:

  • The mention should contain three words
  • The first word must start with “bud” (case insensitive)
  • The second word must start with either “lig” or “lit”
  • The third word must start with “plat”
  • The words must be separated with at least one whitespace character

Let’s build a regular expression that conforms to these rules.  Here is the portion of the regular expression that will match a word starting with “bud”:

bud\w*

The \w* matches any number of word characters.  The match pattern for the third word is constructed similarly.

To match the second word, we want to match a word that starts with “lig” or “lit”:

li[gt]\w*

The [gt] matches either the character “g” or “t”.  Putting these three word patterns together gives:

bud\w* li[gt]\w* plat\w*

This will match any mention that has a word containing “bud” followed by a white space character, then a word that starts with “lig” or “lit”, followed by a white space character, then a word that starts with “plat”.  This is not exactly what we want.  This pattern will match:

  • Bud Light Platinum
  • Bud light platnum
  • Budweiser lite platinum

But it will also match

  • Redbud light platinum Stella Artois

To assure that we match mentions that contain only the synonym pattern we need to surround the regular expression with the start of string and end of string operators:

^bud\w* li[gt]\w* plat\w*$

Finally, we should tolerate any number of whitespace characters between the words.  The expression \s+ will match one or more whitespace characters.  Hence our finished synonym is:

^bud\w*\s+li[gt]\w*\s+plat\w*$

Using Multiple Synonyms in a Group

We may well want to map the response “budlite platinum” to our group, but the synonym we created above will not do that.  There is no space between “bud” and “lite”.  We can fix this in one of two ways.  First, we can try to make the synonym we created above also match this response.  Second, we can make a new synonym to handle this case.  For this simple example it is not hard to make the existing synonym also match this response, but in general it is better to add a new synonym rather than trying to make a single “one size fits all” synonym.  The regular expression patterns are already hard enough to read without making them more complicated!  Here is a second synonym that will do the job:

^budli[gt]\w*\s+plat\w*$

Using the Group Process Order

We may well want to make our taxonomy match “Budweiser” if a response starts with “bud”, but only if it does not match any of the more specific groups.  Groups with a lower Process Order are checked for matches before groups with higher process order.  Groups with the same Process Order value are checked for matches in indeterminant order, so it is important to design the synonyms for groups with the same process order such that no two in different groups match the same string.

We can create a Budweiser group to match any single word that starts with “bud” by giving it a single regular expression synonym with this pattern:

^bud\w*$

Assuming the other groups in the taxonomy have the default Process Order of 100, we can assign a Process Order to this group with any value greater than 100.  This group will now match single words that start with “bud” only if no other group matches the string.

Creating the Rule Set

You create Rule Sets in Ascribe as a manual operation.  We want to create a Rule Set that can be used by the Ascribe Coder API to disambiguate our brand mentions.  Given our brand taxonomy, its job is to map the response provided by the respondent to the corrected brand name.  Fortunately, our Rule Set can be used with any taxonomy, so we need only one Rule Set which can be used with any brand list taxonomy.

Create a new Rule Set in Ascribe.  It initially contains no rules.  We want to populate it with a Modify Response on Load rule to map responses to the correct beer brand using our new taxonomy.

Let’s make a Modify Response on Load rule to map the user specified brand to our curated brand.  It looks like this:

// Replace response with group
if (f.taxonomy) {
  var group = f.taxonomy.Group(f.r);
  if (group) {
    f.r = group;
  }
}

This rule says: if a taxonomy is passed to the rule, map the response to a group using the taxonomy.  If there is a resulting group, replace the response with the group.  The result is that the misspelled brand mention is replaced with the correctly spelled brand name.

Using the Ascribe Coder API

Armed with this taxonomy and Rule Set we have the hard part done.  Now we need to make use of it to automatically code responses.  The Ascribe Coder API supports the use of Rule Sets, which in turn allow access to a taxonomy.

Setting up the Study from Survey Metadata

If you wish, you can use the Ascribe Coder API to create the study and questions in Ascribe from the survey metadata, as described in this post.  Alternatively, you can create the study and questions in Ascribe using the Ascribe web site.

Query for Resources by ID

When we created the taxonomy and Rule Set in Ascribe, we gave each of them an ID.  Via the API we can query the Taxonomies and RuleSets resources to find the key for each.

For example, we can query for the taxonomy list with a GET to

https://webservices.goascribe.com/coder/Taxonomies

The JSON response has this form:

{
  "taxonomies": [
    …
    {
      "key": 190,
      "id": "Beers",
      "description": "Beer brand disambiguation",
      "countGroups": 35,
      "countSynonyms": 47
    },
    …
  ],
  "errors": null
}

If we know that that we named our taxonomy “Beers” we now know that its key is 190.  While the ID of the taxonomy may be changed by the user, the key will never change.  It is therefore safe to store this key away for any future use of the taxonomy.

Keys can be found for other objects from their ID in a similar fashion by querying the appropriate resource of the API.  In this manner we can find the keys for our Rule Set, and for any study and question in that study given their respective IDs.

Creating the Codebook from the Taxonomy

We are using our taxonomy and Rule Set to disambiguate brands as responses are loaded.  If we create the Codebook properly we can automatically code these corrected responses as they are loaded.  As a bonus, we can use the taxonomy to create the codebook automatically.

Once we have our taxonomy key, we can query the Taxonomies resource for the groups and synonyms if the taxonomy.  The response body has this form:

{
  "groups": [
    {
      "key": 22265,
      "name": "Bud Light Platinum",
      "synonyms": [
        {
          "key": 84539,
          "text": "^bud\\w*\\s+li[gt]\\w*\\s+plat\\w*$",
          "isRegEx": true
        },
        {
          "key": 84540,
          "text": "^budli[gt]\\w*\\s+plat\\w*$",
          "isRegEx": true
        }
      ]
    },
    {
      "key": 225,
      "name": "Budweiser",
      "synonyms": [
        {
          "key": 8290,
          "text": "^bud\\w*$",
          "isRegEx": true
        }
      ]
    }
  ],
  "key": 190,
  "id": "Beers",
  "username": "cbaylis",
  "description": "Beer brand disambiguation",
  "errors": null
}

Note that the group names are what we want as codes in our codebook.  These are the correctly spelled brands.  Now, to automatically code the corrected responses, all we need to do is provide a regular expression for each code in the codebook with the correct brand name, surrounded by the start and end of string operators, for example ^Budweiser$.

We POST to the Codebooks resource to create the codebook.  The request body has this form:

{
  "codebook": [
    {
      "description": "Bud Light Platinum",
      "regexPattern": "^Bud Light Platinum$"
    },
    {
      "description": "Budweiser",
      "regexPattern": "^Budweiser$"
    }
  ]
}

We have created our codebook from the taxonomy and have prepared it for regular expression coding of the correctly spelled brand names.

As a defensive programming note, you should escape any regular expression operators that may appear in the brand name.  This would include such characters as [.$*+?].

Loading and Automatically Coding Responses

We now have all the tools in place to load and automatically code responses.  We can do this after data collection is completed, or in real time while the survey is in field.

We can put the responses into an Ascribe study with a POST to the Responses resource of the API, as described here: https://webservices.goascribe.com/coder/Help/Api/POST-Responses-QuestionKey.  In the body of the POST we send the responses, and specify our Rule Set and taxonomy, and that we want to automatically code responses using regular expression matching to the codes in the codebook.  The body of the POST has this form:

{
  "responses": [
    {
      "rid": "100",
      "verbatim": "budwiser",
      "transcription": "budwiser"
    },
    {
      "rid": "101",
      "verbatim": "bud lite platnum",
      "transcription": "bud lite platnum"
    }
  ],
  "autoCodeByRegex": true,
  "ruleSetKey": 6,
  "taxonomyKey": 190
}

Note that we provide the text of the response for both the verbatim and transcription of each response.  The combination of our taxonomy and Rule Set will change the verbatim to the corrected brand name.  By including the original response in the transcription, it is available in Ascribe Coder to see the original response text.

The response body has this form:

{
  "codebookKey": 1540228,
  "responsesAdded": 2,
  "existingResponsesUnchanged": 0,
  "responsesModifiedByRuleSet": 2,
  "responsesVetoedByRuleSet": 0,
  "responsesCodedByTextMatch": 0,
  "responsesCodedByInputId": 0,
  "responsesCodedByRegex": 2,
  "addedResponses": [
    {
      "rid": "100",
      "codes": [
        {
          "codeKey": 915872,
          "description": "Budweiser"
        }
      ]
    },
    {
      "rid": "101",
      "codes": [
        {
          "codeKey": 915873,
          "description": "Bud Light Platinum"
        }
      ]
    }
  ],
  "errors": null
}

The rid values in the response correspond to those in the request.  We see that we have mapped the misspelled brand names to their correct spellings, and automatically applied the code for those corrected brand names.  The codeKey values correspond to the code in our codebook.

If you are using the Ascribe Coder API directly from the survey logic, the codeKey and/or description can be used for branching logic in the survey.

Summary

We have seen how to use the Ascribe Coder API in conjunction with a taxonomy to correct brand name misspellings and automatically code the responses.  While we have limited our attention to brand names, this technique is applicable whenever short survey responses in a confined domain need to be corrected and coded.

Leave a Reply

Your email address will not be published. Required fields are marked *