Figuring out what humans are saying in written language is a difficult task.  There is a huge amount of literature, and a great many software attempts to achieve this goal.  The bottom line is that we are still a long way off from having computers really understand human language.  Still, computers can do a pretty good job at what we are after: getting concepts and sentiment from text.

The term linguistic analysis covers a lot of territory.  We will use it in the narrow sense of a computer’s attempt to extract meaning from text.  Linguistic analysis is the theory behind what the computer is doing.  We say that the computer is performing Natural Language Processing (NLP) when it is doing an analysis based on the theory.  Linguistic analysis is the basis for Text Analytics.

There are steps in linguistic analysis that are used in nearly all attempts for computers to understand text.  It’s good to know some of these terms.

Here are some common steps, often performed in this order:

  • Sentence detection
    Here the computer tries to find the sentences in the text. Many linguistic analysis tools confine themselves to analysis of one sentence at a time, independent of the other sentences in the text.  This make the problem more tractable for the computer but introduces problems.  John was my service technician.  He did a super job.  Considering the second sentence on its own, the computer may determine that there is strong positive sentiment around the job.  But if the computer considers only one sentence at a time, it will not figure out that it was John who did the super job.
  • Tokenization
    Here the computer breaks the sentence into words. Again, there are many ways to do this, each with their own strengths and weaknesses.  The quality of the text matters a lot here.  i really gotmad when the tech told me *your tires are flat*heck I knew that.  Lots of problems here for the computer.  Humans see gotmad and know instantly that there should have been a space.  Computers are not very good at this.  Simple tokenizers simply take successive “word” characters and throw away everything else.  Here that would do an OK job with heck*flatheck flat, but it would remove the information that *your tires are flat* is a quote and not really part of the surrounding sentence.  When the quality of text is poor this type of thing can really get the computer confused.
  • Lemmatization and cleaning
    Most languages allow for multiple forms of the same word, particularly with verbs. So, in English, was, is, are, were are all forms of the verb to be.  The lemma is the base form of a word.  The lemma for all these words is be.  There is a related technique called stemming, which tries to find the base part of a word, for example poniesponi.  Lemmatization normally uses lookup tables, whereas stemming normally uses some algorithm to do things like discard possessives and plurals.  Lemmatization is usually preferred over stemming.
    Some linguistic analysis attempt to “clean up” the tokens.  The computer might try to correct common misspellings or convert emoticons to their corresponding words.
  • Part of speech tagging
    Once we have the tokens (words) we can try to figure out the part of speech for each of them, such as noun, verb, adjective. Simple lookup tables let the computer get a start at this, but it is really a much more difficult job that that.  Many words in English can be both nouns and verbs (and other parts of speech).  To get this right the words cannot simply be considered one at a time.  Mistakes in part of speech tagging often lead to embarrassing mistakes by the computer.

Most linguistic analysis tools perform the above steps before tackling the job of figuring out what the tokenized sentences mean.  At this point the various approaches to linguistic analysis diverge.  We will describe in brief the three most common techniques.

Sentence parsing

Noam Chomsky is a key figure in linguistic theory.  He conceived the idea of a “universal grammar”, a way of constructing speech that is somehow understood by all humans and used in all cultures.  This leads to the idea that if you can figure out the rules a computer could do it, and thereby understand human speech and text.  The sentence parsing approach to linguistic analysis has its roots in this idea.

A parser takes a sentence and turns it into something akin to the sentence diagrams you probably did in elementary school:

At the bottom we have the tokens, and above them classifications that group the tokens.  V = verb, PP = prepositional phrase, S = sentence, and so on.

Once the sentence is parsed the computer can do things like give us all the noun phrases.  Sentence parsing does a good job of finding concepts in this way.  But parsers expect well-formed sentences to work on.  They do a poor job when the quality of the text is low.  They are also poor at sentiment analysis.

Bitext is an example of a commercial tool that uses sentence parsing.  More low level tools include Apache OpenNLP, Stanford CoreNLP, and GATE.

Rules based analysis

Rules-based linguistic analysis takes a more pragmatic approach.  In a rule-based approach the focus is simply on getting the desired results without attempting to really understand the human language.  Rules-based analysis always focuses on a single objective, say concept extraction.  We write a set of rules that perform concept extraction and nothing else.  Contrast this with a parsing approach, where the parsed sentence may yield concepts (nouns and noun phrases) or entities (proper nouns) equally well.

Rules-based linguistic analysis usually has an accompanying computer language used to write the rules.  This may be augmented with the ability to use a general-purpose programming language for certain parts of the analysis.  The GATE platform provides the ability to use custom rules using a tool it calls ANNIE, along with the Java programming language.

Rules-based analysis also uses lists of words called gazetteers.  These are lists of nouns, verbs, and so on.  A gazetteer also provides something akin to lemmatization.  Hence the verbs gazetteer may group all forms of the verb to be under the verb be.  But the gazetteer can take a more direct approach.  For sentiment analysis the gazetteer may have an entry for awful, with sub-entries horrible, terrible, nasty.  Therefore, the gazetteer can do both lemmatization and synonym grouping.

The text analytics engines offered by SAP are rules-based.  They make use of a rule language called CGUL (Custom Grouper User Language).  SAP says of this language (emphasis added):

Custom Grouper User Language (CGUL) is a sentence-based language that enables you to perform pattern matching using character or token-based regular expressions combined with linguistic attributes to define custom entity types. Working with CGUL can be very challenging.

Here is an example of what a rule in the CGUL language looks like:

#subgroup VerbClause: {
    ( %(Nouns)*%(NonBeVerbs)+)
    |([OD VB]%(NonBeVerbs)+|%(BeVerbs) [/OD])
    |([OD VB]%(BeVerbs)+|%(NonBeVerbs)+ [/OD])

  | ( [OD VB]%(NonBeVerbs)[/OD]   )

At its heart, CGUL uses regular expressions and gazetteers to form increasingly complex groupings of words.  The final output of the rules are the finished groups, for example concepts.
Many rules-based tools expect the user to become fluent in the rule language.  Giving the user access to the rule language empowers the user to create highly customized analyses, at the expense of training and rule authoring.

Deep learning and neural networks

The third approach we will discuss is machine learning.  The basic idea of machine learning is to give the computer a bunch of examples of what you want it to do, and let it figure out the rules for how to do it.  This basic idea has been around for a long time and has gone through several evolutions.  The current hot topic is neural networks.  This is an approach to natural language machine learning is based loosely on the way our brains work.  IBM has been giving this a lot of publicity with its Watson technology.  You will recall that Watson beat the best human players of the game of Jeopardy.  We can get insight into machine learning technique from this example.

The idea of deep learning is to build neural networks in layers, each working on progressively broader sections of the problem.  Deep learning is another buzzword that often applied outside of the area intended by linguistic researchers.

We won’t try to dig into the details of these techniques, but instead focus on the fundamental requirement they have.  To work, machine learning needs examples.  Lots of examples.  One area that machine learning has excelled is image recognition.  You may have used a camera that can find the faces in the picture you are taking.  It’s not hard to see how machine learning could do this.  Give the computer many thousands of pictures and tell it where the faces are.  It can then figure out the rules to find faces.  This works really well.

Back to Watson.  It did a great job at Jeopardy.  Can you see why?  The game is set up perfectly for machine learning.  First, the computer is given an answer.  The computer’s job is to give back the correct question (in Jeopardy you are given the answer and must respond with the correct question).  Since Jeopardy has been played for many years, the computer has just what it needs to work with: a ton of examples, all set up just the way needed by the computer.

Now, what if we want to use deep learning to perform sentiment analysis?  Where are we going to get the examples?  It’s not so easy.  People have tried to build data sets to help machines learn things like sentiment, but the results to date have been disappointing.  The Stanford CoreNLP project has a sentiment analysis tool that uses machine learning, but it is not well regarded.  Machine learning today can deliver great results for concept extraction, but less impressive results for sentiment analysis.


Linguistic analysis is a complex and rapidly developing science.  Several approaches to linguistic analysis have been developed, each with its own strengths and weaknesses.  To obtain the best results you should choose the approach that gives superior performance for the type of analysis you need.  For example, you may choose a machine learning approach to identify topics, a rules based approach for sentiment analysis, and a sentence parsing approach to identify parts of speech and their interrelationships.