The How's and Wow's of Natural Language Processing

By Maria Hussain on 9th March, 2021
Have you ever wondered how your smartphone can recognize so many different accents of the English language? What about how you can type in Urdu or Arabic words without using their alphabets and Google Translate still knows what you’re talking about? Have you seen those videos of toddlers screaming at Alexa to play the Baby Shark song?
All of these are real-life examples of natural language processing working its magic.
So, what exactly is natural language processing?
Natural Language Processing, or NLP, is the result of combining computer sciences, artificial intelligence and human language. It is the technology used to comprehend and utilize textual data without the need to change the natural ways it occurs in human communication.
Keep in mind that up to 90% of the data we generate and possess is in unstructured form. This means, that things like biodata forms, complaint registries, feedback surveys, letters, emails, customer inquiries and their solutions- all of these are not neatly arranged in a spreadsheet according to the subject they talk about. Moreover, they are written by common people in their normal way of speaking or writing, and with particular positive or negative connotations. All of this makes it very difficult to understand language for anyone who does not speak it; let alone, a machine. Even if it was understood and we attempted to arrange hundreds of billions of words into orderly rows and columns, it is quite impossible to imagine that we would ever finish.
Natural Language Processing does this in minutes.
Instead of extensively reading and analyzing every word of the data provided, it works by focusing on special patterns and keywords. Let us now learn how.
Six basic steps for NLP
Tokenization
The very first thing an NLP system does, is splitting a sentence, a paragraph, or even an entire text document into smaller pieces or ‘tokens’. A token may be made up of one or more than one words that are always grouped together. For the machine to start reading any language, it must first understand where one meaningful word ends, and another begins.
For instance, in the sentence, “I made fried eggs and toast for breakfast today”, possible tokens can be ‘I’, ‘made’, ‘fried’, ‘eggs’, ‘and’, ‘toast’, ‘for’, ‘breakfast’, ‘today’.
Stemming
In stemming, the system breaks down each token to its root, by removing any suffix or prefix present in it. In the aforementioned sentence, ‘made’ would become ‘make’ and ‘fried’ may be changed to ‘fry’. Although stemming is a rapid process, it may not always be accurate to the context or meaning of a word’s usage. To make it clearer, words like ‘affect’, ‘affecting’, ‘affection’, and ‘affectionate’ will all turn to ‘affect’.
Lemmatization
Similar to stemming, lemmatization also breaks the word down to its root-in this case, a lemma- but it differs in the precision it provides. In stemming, the root word might not even be a word of the language, but a lemma is always a proper word that makes sense to the context. On the other hand, lemmatization is not as rapid.
Both of them are used commonly, often in combination. Stemming can individually be applied to situations where the meaning of the word does not matter- e.g. a spam detection system; whereas lemmatization would be preferred where meaning and context matter most, like in analyzing a questionnaire form.
POS Tagging
Another step in NLP is POS tagging, or part-of-speech tagging; in which the natural language processing system assign parts of speech to the text, according to the meaning and contextual use of each word. This helps the computer interpret language better, especially in cases where a word can have more than one parts of speech. Consider these two sentences:
• He is trying to court her.
• I will see you at court!
In the first instance, ‘court’ will be labelled as a verb, and in the latter, it will be labeled a noun. POS tagging establishes the difference using certain rules and/or probability based techniques.
Named Entity Recognition
Named Entity Recognition, or NER, is the process of locating and appropriately categorizing parts of a sentence that are names. The categories can be names of people, places, businesses, organizations, movies, TV-shows, percentages, quantities, expressions of time and space, monetary values, and so on. An example sentence to demonstrate this feature is, “I had a Starbucks coffee during my transit at Heathrow.” Here, ‘Starbucks’ and ‘Heathrow’ will be recognized as named entities and categorized into name of business and name of place, respectively.
Chunking
Chunking is the last important step of NLP, and works on top of the POS tags and NER labels. Basically, it is the process of- you guessed it- ‘chunking’ together tokens that relate to each other in a sentence or piece of text. Some common examples of chunk phrases include noun phrases, verb phrases, adjective phrases, etc.
To explain it even further, the words “a yellow flower” may be chunked together to make a noun phrase, and the sentence “I was running very fast” will entirely be classified as a verb phrase.
Okay, but how does natural language processing generate text?
Now that we have covered the basics of how NLP comprehends unstructured text, we can go on to understand how it can generate it. With the deep understanding of a language, in its normal forms and tones, an artificially intelligent computer can communicate using that language in a very human manner. All it has to do is mimic everything it has learned, and use it to present text, or text-to-speech based results.
Applications of natural language processing
The following are some common areas where NLP is used in the everyday world:
• Sentiment Analysis: Also called opinion mining, is a form of data mining which deals with the inclinations and emotions of users from data provided, using NLP systems.
• Chatbot: Chatbots are often found on business or service websites which help the user navigate better, and is able to answer the questions asked in natural human language.
• Speech Recognition: NLP driven systems can understand and execute voice commands, in multiple accents and dialects of a language.
• Machine Translation: This is the feature of NLP that is used in translating one language into another.
• Spell Checking: The rapid process of spelling and grammar checking is one of the most common uses of natural language processing. With its ability to skim through text at a rapid pace, it can easily point out any mistakes in text, according to its preprogrammed dictionary and rulebook.
• Keyword Searching: Websites and social media applications utilize NLP to effectively search for keywords in any number of text inputs.
• Information Extraction: A computer that has NLP installed can help the user search for the information they need. For example, in quality control, cause and effect analysis, and statistics.
• Advertisement Matching: By reading the search terms of a user, or looking through their browsing history, the NLP system is able to inform companies about what the best matched advertisements would be, for that particular user.
To conclude, Natural Language Processing is one of the most impressive and useful technologies of our world today; with the ability to combine human and computer knowledge, and create groundbreaking advancements in how we navigate our lives online and offline.