Let's resume our discussion on NLP pipeline. In the previous blog post we completed the first 2 steps of NLP pipeline that is Data aquisition and text cleaning. In this blog post we will cover data pre-processing and feature engineering.

PRE-PROCESSING:

To pre-process your text simply means to bring your text into a form that is predictable and analysable for your task. A task here is a combination of approach and domain. For example, extracting top keywords with TF-IDF (approach) from Tweets (domain) is an example of a Task.

                  Task = approach + domain 


One task’s ideal pre-processing can become another task’s worst nightmare. So take note: text pre-processing is not directly transferable from task to task.

Let’s take a very simple example, let’s say you are trying to discover commonly used words in a news dataset. If your pre-processing step involves removing stopwords because some other task used it, then you are probably going to miss out on some of the common words as you have ALREADY eliminated it. So really, it’s not a one-size-fits-all approach.

Here are some common pre-processing steps used in NLP software:

  • Preliminaries:
    Sentence segmentation and word tokenization.
  • Frequent steps:
    Stop word removal, stemming and lemmatization, removing digits/punctuation, lowercasing, etc.
  • Advanced processing:
    POS tagging, parsing, coreference resolution, etc.

Preliminaries :

While not all steps will be followed in all the NLP pipelines we encounter, the first two are more or less seen everywhere. Let’s take a look at what each of these steps mean.

The NLP can analysis the text by breaking it into sentence(sentence segmentation) and then further into words(words tokenization).

SENTENCE SEGMENTATION :

We may easily divide the text into sentence on the basis of the position of the full stop(.). But what happen if we have Dr.Joy or (….) in our text.

We have NLP libraries which help to overcome these issue. Like NLTK.

import nltk
from nltk.tokenize import sent_tokenize
text = "It's fun to study NLP. Would recommend to all."
print(sent_tokenize(text))
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
["It's fun to study NLP.", 'Would recommend to all.']

WORD TOKENIZATION :

To tokenize a sentence into words, we can start with a simple rule to split text into words based on the presence of punctuation marks. The NLTK library allows us to do that.

from nltk.tokenize import word_tokenize
text = "It's fun to study NLP. Would recommend to all."
print(word_tokenize(text))
['It', "'s", 'fun', 'to', 'study', 'NLP', '.', 'Would', 'recommend', 'to', 'all', '.']

Frequent Steps :

Some frequent steps for pre-processing steps are:

  • Lower casing
  • Removal of Punctuations
  • Removal of Stopwords
  • Removal of Frequent words
  • Stemming
  • Lemmatization
  • Removal of emojis
  • Removal of emoticons

Well these steps are frequent but they may vary problem to problem. For e.g. let us consider we want to predict whether the given text belong to news, music or any other field. So for this problem we cannot remove the Frequent words like news article might contain the word news in it a lot, Hence to categorize the text we cannot just remove it.

Now you might be wondering what is lemmatization and stemming are. So before we move forward lets understand these terms.

Stemming and Lemmatization helps us to achieve the root forms (sometimes called synonyms in search context) of inflected (derived) words.

Stemming :

Stemming is faster because it chops words without knowing the context of the words in given sentences.

  • It is rule-based approach.
  • Accuracy is less.
  • When we convert any words into root-form then stemming may create the non-existence meaning of a word.
  • Stemming is preferred when the meaning of the word is not important for analysis. For e.g in spam detection.
  • For example : Studies => Studi
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

word_data = "Da Vinci Code is such an amazing book to read.The book is full of suspense and Thriller. One of the best work of Dan Brown."
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
       print("Actual: %s  Stem: %s"  % (w,porter_stemmer.stem(w)))
Actual: Da  Stem: Da
Actual: Vinci  Stem: vinci
Actual: Code  Stem: code
Actual: is  Stem: is
Actual: such  Stem: such
Actual: an  Stem: an
Actual: amazing  Stem: amaz
Actual: book  Stem: book
Actual: to  Stem: to
Actual: read  Stem: read
Actual: .  Stem: .
Actual: .The  Stem: .the
Actual: book  Stem: book
Actual: si  Stem: si
Actual: full  Stem: full
Actual: of  Stem: of
Actual: suspense  Stem: suspens
Actual: and  Stem: and
Actual: Thriller  Stem: thriller
Actual: .  Stem: .
Actual: One  Stem: one
Actual: of  Stem: of
Actual: the  Stem: the
Actual: best  Stem: best
Actual: work  Stem: work
Actual: of  Stem: of
Actual: Dan  Stem: dan
Actual: Brown  Stem: brown

Lemmatization :

Lemmatization is slower as compared to stemming but it knows the context of the word before proceeding.

  • It is a dictionary-based approach.
  • Accuracy is more as compared to Stemming.
  • Lemmatization always gives the dictionary meaning word while converting into root-form.
  • Lemmatization would be recommended when the meaning of the word is important for analysis. for example in Question Answer application.
  • For Example: “Studies” => “Study”
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

word_data = "Da Vinci Code is such an amazing book to read.The book is full of suspense and Thriller. One of the best work of Dan Brown."
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
       print("Actual: %s  Lemma: %s"  % (w,wordnet_lemmatizer.lemmatize(w)))
Actual: Da  Lemma: Da
Actual: Vinci  Lemma: Vinci
Actual: Code  Lemma: Code
Actual: is  Lemma: is
Actual: such  Lemma: such
Actual: an  Lemma: an
Actual: amazing  Lemma: amazing
Actual: book  Lemma: book
Actual: to  Lemma: to
Actual: read.The  Lemma: read.The
Actual: book  Lemma: book
Actual: is  Lemma: is
Actual: full  Lemma: full
Actual: of  Lemma: of
Actual: suspense  Lemma: suspense
Actual: and  Lemma: and
Actual: Thriller  Lemma: Thriller
Actual: .  Lemma: .
Actual: One  Lemma: One
Actual: of  Lemma: of
Actual: the  Lemma: the
Actual: best  Lemma: best
Actual: work  Lemma: work
Actual: of  Lemma: of
Actual: Dan  Lemma: Dan
Actual: Brown  Lemma: Brown
Actual: .  Lemma: .

Some other pre-processing steps that are not that common are:

Text Normalization:

Text normalization is the process of transforming a text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good”, its canonical form. Another example is mapping of near identical words such as “stopwords”, “stop-words” and “stop words” to just “stopwords”.

Language Detection:

Well what happen if our text is in other language apart from English. Then our whole pipeline need to be modified according to that language. So for that we need to detect the language before creating the pipeline. Python provides various modules for language detection.

  • langdetect
  • textblob
  • langrid
from langdetect import detect 
  
# Specifying the language for 
# detection 
print(detect("Geeksforgeeks is a computer science portal for geeks")) 
print(detect("Geeksforgeeks - это компьютерный портал для гиков")) 
print(detect("Geeksforgeeks es un portal informático para geeks")) 
print(detect("Geeksforgeeks是面向极客的计算机科学门户")) 
print(detect("Geeksforgeeks geeks के लिए एक कंप्यूटर विज्ञान पोर्टल है")) 
print(detect("Geeksforgeeksは、ギーク向けのコンピューターサイエンスポータルです。")) 
en
ru
es
no
hi
ja

Advance Processing

POS Tagging :

Imagine we’re asked to develop a system to identify person and organization names in our company’s collection of one million documents. The common pre-processing steps we discussed earlier may not be relevant in this context. Identifying names requires us to be able to do POS tagging, as identifying proper nouns can be useful in identifying person and organization names. Pre-trained and readily usable POS taggers are implemented in NLP libraries such as NLTK, spaCy and Parsey McParseface Tagger.

tokens = nltk.word_tokenize("The quick brown fox jumps over a lazy dog")
print("Part of speech",nltk.pos_tag(tokens))
Part of speech [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('a', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

Parse Tree:

Now if we have to find the relationship between person and organization then we need to analyse the sentence in depth and for that parse tree play a major role. Parse tree is a tree representation of different syntactic categories of a sentence. It helps us to understand the syntactical structure of a sentence.

from nltk import pos_tag, word_tokenize, RegexpParser 
   
# Example text 
sample_text = "The quick brown fox jumps over the lazy dog"
   
# Find all parts of speech in above sentence 
tagged = pos_tag(word_tokenize(sample_text)) 
   
#Extract all parts of speech from any text 
chunker = RegexpParser(""" 
                       NP: {<DT>?<JJ>*<NN>}    #To extract Noun Phrases 
                       P: {<IN>}               #To extract Prepositions 
                       V: {<V.*>}              #To extract Verbs 
                       PP: {<P> <NP>}          #To extract Prepostional Phrases 
                       VP: {<V> <NP|PP>*}      #To extarct Verb Phrases 
                       """) 
  
# Print all parts of speech in above sentence 
output = chunker.parse(tagged) 
print("After Extracting\n", output) 
After Extracting
 (S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  (VP (V jumps/VBZ) (PP (P over/IN) (NP the/DT lazy/JJ dog/NN))))

Conference Resolution :

Coreference resolution is the task of finding all expressions that refer to the same entity in a text.

FEATURE ENGINEERING:

When we use ML methods to perform our modeling step later, we’ll still need a way to feed this pre-processed text into an ML algorithm. Feature engineering refers to the set of methods that will accomplish this task. It’s also referred to as feature extraction. The goal of feature engineering is to capture the characteristics of the text into a numeric vector that can be understood by the ML algorithms.

two different approaches taken in practice for feature engineering in

classical NLP and traditional ML pipeline

Feature engineering is an integral step in any ML pipeline. Feature engineering steps convert the raw data into a format that can be consumed by a machine. These transformation functions are usually handcrafted in the classical ML pipeline, aligning to the task at hand. For example, imagine a task of sentiment classification on product reviews in e-commerce. One way to convert the reviews into meaningful “numbers” that helps predict the reviews’ sentiments (positive or negative) would be to count the number of positive and negative words in each review. There are statistical measures for understanding if a feature is useful for a task or not.

One of the advantages of handcrafted features is that the model remains interpretable—it’s possible to quantify exactly how much each feature is influencing the model prediction.

DL pipeline

In the DL pipeline, the raw data (after pre-processing) is directly fed to a model.The model is capable of “learning” features from the data. Hence, these features are more in line with the task at hand, so they generally give improved performance. But, since all these features are learned via model parameters, the model loses interpretability.

RECAP:

The first step in the process of developing any NLP system is to collect data relevant to the given task. Even if we’re building a rule-based system, we still need some data to design and test our rules. The data we get is seldom(rarely) clean, and this is where text cleaning comes into play. After cleaning, text data often has a lot of variations and needs to be converted into a canonical (principle or a pre-defined way) form. This is done in the pre-processing step. This is followed by feature engineering, where we carve out indicators that are most suitable for the task at hand.These indicators/features are converted into a format that is understandable by modeling algorithms.

Note: In the future blog post indepth implementation will be done,so don’t worry if things seemed little hazzy:)