Quickly Organize Your Unstructured Text Using Natural Language Classifier

Andrew R. Freed
7 min readJun 7, 2018
Photo by Maarten van den Heuvel on Unsplash

Introduction
Classification technology allows you to process a lot of information quickly — this is one of the key benefits of the technology. Instead of humans spending hours sorting through information looking for relevant or action-worthy information, classification algorithms can surface interesting nuggets from unstructured text in seconds or minutes. In this post I’ll explore various ways you can apply classification to your problem domain with a focus on Watson Natural Language Classifier.

What is classification?
Wikipedia defines classification as “a process related to categorization, the process in which ideas and objects are recognized, differentiated, and understood”. You can think about classification as “given an input, which output class does it most resemble?”

Classification example with “The Thinker” from https://commons.wikimedia.org/wiki/File:The_Thinker,_Auguste_Rodin.jpg.

There are several different ways you might want to classify your textual information, here are just a couple:

  • Topic: What topic is a piece of text about? Is this a legal text or a fictional story? You can train Watson Natural Language Classifier to answer these questions.
  • Emotion and/or sentiment: How does the speaker feel? Are they positive, negative, or neutral towards a subject, and happy, sad, or angry? Watson Tone Analyzer is pre-trained to help answer these questions.
  • Intent: What is task the speaker trying to accomplish? This is an underpinning of conversational systems like Watson Assistant.

As you can see above, text classification can work on a variety of input and output types. Classification systems are generally trained by first defining two or more target categories and then by providing examples for each of those categories. We interchangeably refer to these categories as ‘classes’. The Watson product and services catalog includes several classifier tools including pre-built classifiers like Tone Analyzer and train-your-own classifiers like Natural Language Classifier. In this blog post we will explore classification techniques with a focus on Natural Language Classifier.

Machine learning classification techniques

Watson Natural Language Classifier offers a “no code” way to implement classification using machine learning. As mentioned earlier, you need only to provide the classes (categories) and examples, and a black box can handle the rest. You may have experience with personal email spam filters. Every time you take an email and “mark as spam”, you are giving an additional training data to a classifier with at least two categories (“spam” and “not spam”).

Machine learning classifiers work well over a wide range of input data. They work best when classifications have distinct boundaries and a large number of ways to be expressed. Email is a great example — there are probably trillions of emails that could be described as spam or not-spam, clearly distinct classifications.

Training an ML classifier requires a “representational set” of training data. If we can provide an accurate sample of data that looks like the broader set, we can train on that smaller sample, giving us much quicker results. Some email spam/not-spam training systems have you get started with just 20 spam and not-spam messages — a far cry from the thousands of emails that might be sitting in your inbox right now, or the trillions that exist world-wide. Enterprise-grade classifiers are trained on subsets as well, just larger subsets. Using an accurate, representational subset is important because machine learning models learn only from example, thus your training set needs to supply all the examples the model needs to learn.

Machine learning classification demo with Natural Language Classification

Without doing any coding of your own, you can try out a pre-trained ML classifier provided by Natural Language Classification in the demo service. The demo service takes weather-related questions and classifies them as being related to temperature or conditions. Be sure to test out the service with your own questions.

Results for “How warm will it get today?”
Results for “Is it going to sleet this afternoon?”

The classifier was not trained on these questions and had never seen an example with ‘sleet’ yet it was able to correctly classify both questions. Imagine how long it would take if you were asked to classify thousands of temperature or condition questions by hand. Automating that classification with Natural Language Classification would save you a lot of work! Automated classifiers shine at very quickly classifying data into categories, freeing you as an expert to focus on higher level tasks.

You can see more demos of classification at the Natural Language Classifier sample applications page.

Improving a machine learning classifier

As stated earlier, many machine learning classifiers like Natural Language Classifier use a “no code” model. A machine learning model is trained with a set of inputs and the target classes they map to. Improving the model is an exercise of improving the inputs and/or the classes.

Let’s go back to the demo model from Natural Language Classifier. It has been trained on a series of weather-related questions. In the model’s “mind” there are only two types of input possible in the entire universe: temperature or condition questions. This means that any input you send it can only be classified to those two classes. Try sending it different input: “What is the definition of precipitation?” or “What is the meaning of life?”. The results may surprise you but remember the classifier only knows that all inputs can only be one of two outputs and thus it tells you how likely the selected class is better than the other class.

You can improve the performance of this classifier in two ways. The first is by defining additional target classes and providing new training data for those classes. In the example above you could provide a target of “meaning” or even “off-topic”. Be sure that your new training set is still representative of the universe of input you expect to receive. Do not fabricate new training data, take training data directly from user input from logs or other sources. This helps ensure representativeness — no matter how clever you are any training data you fabricate is not going to match the way users interact with your system.

Secondly you can improve the performance by influencing the input that is sent to the classifier. You can provide users sample inputs or you can use auto-complete to nudge their incomplete input towards inputs your classifier is good at dealing with. You can even chain classifiers together to route inputs towards the classifier most likely to be able to handle a given request — this is called ensemble modeling and we will discuss it more later.

The benefit of a machine learning classifier is that it is “no code” and you can improve it just by working with the inputs and outputs, letting the machine learning model figure out the rest.

Rules-based classification techniques

Classification does not always require machine learning — sometimes a simple rule is enough to get the job done. Using rules-based classification is important when you have clear insights/rules that you need applied and you don’t want to leave anything to chance. For example:

· The first three characters of a product code may tell you the manufacturer

· A swear word filterer has a limited set of words that must always be flagged

· A form always contains certain information in precise coordinates (ex: US tax form 1040)

· Use document metadata to find the author name rather than inferring it from the text body

Rules-based classifiers work well when the number of significant input variations is finite. In each example above a definite rule can be identified quickly without a need for a machine learning model to infer it.

Ensemble models

Ensemble modeling is a great way to get the benefits of multiple models. In classification we can use ensemble classifiers to get the best out of both machine learning and rules-based techniques. By running multiple models sequentially or in parallel we can get superior results.

  • When classifying document types, we use the machine learning model unless the first nine words are “Form 1040 Department of the Treasury — Internal Revenue Service”
  • When classifying email, we use the machine learning classifier first. Then we use a rule check if the sender is on a known whitelist or blacklist.
  • When classifying text for obscenity, we let the machine learning analysis run first and then double-check against our rules list of known swear words.

We can even run multiple classifiers in parallel, “competing” against one another, and then use a resolution mechanism such as voting to decide which result we want to use.

As alluded in the examples above, one important reason to use ensembles is to take advantage of additional context. In the email classification mechanism we use a machine learning model on the email subject + body and then use rules on the sender’s address. When classifying text you may be able to get better insight by including aspect of the user’s profile. For instance when classifying my Twitter activity you might start with the default assumption that each tweet is about technology, since my profile tells you I tweet about technology. When accuracy is paramount in your classification, use all the context that is available to you.

Conclusion

Text classification is intended to help you organize unstructured text. Natural Language Classification allow you to do this classification quickly and with no coding required. For best classification results you need to understand your inputs and whether they have a finite or infinite number of significant differences. Use machine learning when you want to train by example and use rules-based classifiers when you have finite, clearly defined rules available. You can combine multiple classification techniques in an ensemble to get the best results, particularly when there is additional content that cannot be directly fed into a classifier.

--

--

Andrew R. Freed

Technical lead in IBM Watson. Author: Conversational AI (manning.com, 2021). All views are only my own.