Stemming and Lemmatization NLTK

In this blog, I will try to explain what is stemming and lemmatization in nltk python with a few examples.

Stemming and Lemmatization is Text Normalization or Word Normalization techniques in the field of Natural Language Processing. They are used to prepare text, words, and documents for further processing. Often they are searched by the name nltk lemmatizer and nltk stemming.

But before we proceed further it is important to clear a little doubt that arises in beginners’ minds about what exactly is NLTK and NLP? At times they are used interchangeably NLTK acts as the main tool for NLP and Machine Learning. NLTK contains stemmers, lemmatizers, tokenizers, algorithms, and others to help process human language into something that a machine could easily understand.

Natural Language Toolkit (NLTK) is a Python package used for Natural Language Processing (NLP).

Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form—generally a written word form. The stem need not be identical to the root, but a slight similarity with the root is enough to map the words with the same stem.

Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

Stemming Algorithms

There are various Stemming algorithms in Python that helps to stem the word. The algorithms use different stemmers and there are various stemmers available in different languages in Python NLTK. English Language: Porter Stemmer or Lancaster Stemmer is also known as (Paice-Husk Stemmer) Non-English Language: Snowball Stemmers (supports various languages like Dutch, Italian, German, French, Russian, English, etc), ISRI Stemmer (Arabic Language), RSLPS Stemmer (Portuguese language).

Porter Stemmer

PorterStemmer is the oldest stemmer. It is known for its simplicity and speed. It is commonly useful in Information Retrieval Environments known as IR Environments for fast recall and fetching of search queries.

Make sure that you have NLTK downloaded

Let us now see how Lancaster Stemmer works.

Lancaster Stemmer

The LancasterStemmer (Paice-Husk stemmer) is an iterative algorithm with rules saved externally. With each iteration, the algorithm tries to find the rule by the last character of the word. Each rule specifies either a deletion or replacement of an ending. If there is no such rule, it terminates.

The disadvantage of Lancaster Stemmer is over stemming, due to heavy stemming caused because of iteration. Over-stemming may lead to stems having no meaning or are non-linguistic.

The catch here is that the meaning of the word is preserved with Lancaster stemmer, while Porter Stemmer simply stems the word irrespective of the meaning.

Sentence Stemming

Let us understand how sentence stemming works and how can we stem any sentence.

I tried to stem this sentence but got the same sentence back. What else needs to be done? We need to stem each word in the sentence and return a combined sentence. To separate the sentence into words, we use a tokenizer. The nltk tokenizer separates the sentence into words as follows. we create a function and just pass the sentence to the function, and it will give the stemmed sentence.

NLTK Lemmatization

Lemmatization in NLP also reduces the word to its root just like Stemming. Lemmatization is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

However, stemmers are typically easier to implement and run faster. The difference lies in retaining the meaning of the word; with Lemmatization the meaning is not lost, in relation or context to words . So it links words with similar meaning to one word. In Lemmatization root word is called Lemma ; Lemma is a canonical form, dictionary form, or citation form of a set of words.

Ways to Lemmatize

There are different ways to lemmatize the text, sentences, or documents.

WordNet
WordNet (with POS tag)
TextBlob
TextBlob (with POS tag)
spaCy
TreeTagger
Pattern
Gensim
Stanford CoreNLP

In this blog let us understand how WordNet with POS tag works

Example

We provide the context in which we want to lemmatize that is the parts-of-speech (POS). This is done by giving the value for pos parameter in wordnet_lemmatizer.lemmatize.

Notice, v denotes verb; a is for adjective; n is for Noun (by default)

Applications of Stemming and Lemmatization

Text Mining is the process of analysis of texts written in natural language and extract high-quality information from text. It looks for interesting patterns in the text or to extract data from the text. Text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Developers have to prepare text using various methods like : lexical analysis, POS (Parts-of-speech) tagging, stemming and other Natural Language Processing techniques to derive useful information from the text.
Sentiment Analysis is the analysis of people’s reviews and comments . It is widely used for analysis of product on online retail shops. Stemming and Lemmatization is used as part of the text-preparation process before it is analyzed.
Document Clustering is the application of cluster analysis.The document is tokenized, stop words are removed, stemming and Lemmatization is performed to reduce the number of tokens, this speed up the whole process. After this pre-processing, features are calculated by calculating the frequency of all tokens and then clustering methods are applied. It has applications in an automatic document organization, topic extraction, and fast information retrieval or filtering. Examples of document clustering include web document clustering for search engines

What to use: NLTK Stemmers or NLTK Lemmitzers

It depends on the application you are working that decides what to use either a stemmers or lemmatizers. If speed is the focus, use Stemmers and if meaning of the words is important, building any application where meaning of language is crucial, when knowledge of context is required then go for Lemmitzers.

References

https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
https://en.wikipedia.org/wiki/Stemming
https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/
https://en.wikipedia.org/wiki/Lemmatisation

Making Data Science Easy for Beginners

Stemming and Lemmatization NLTK

Stemming