If we were asked to build an NLP application, think about how we would approach doing so at an organization. We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them. Since language processing is involved, we would also list all the forms of text processing needed at each step.

This step-by-step processing of text is known as a NLP pipeline. It is the series of steps involved in building any NLP model.

The key stages in the pipeline are as follows:

Data acquisition
Text cleaning
Pre-processing
Feature engineering
Modeling
Evaluation
Deployment
Monitoring and model updating

Before we dive into NLP applications implementation the first and foremost thing is to get a clear picture about it’s pipeline. Hence, below are a detail overview about each component in it's pipeline.

Note: The blog post on NLP pipeline is divided into 3 blog post. The first blog post covers the Data acquisition and Text Cleaning. Second blog post covers Pre-processing and Feature engineering and the 3rd blog post covers modeling, Evaluation, Deployment and Moniotring and model updating.

DATA ACQUISITION

Data plays a major role in the NLP pipeline. Hence it's quite important that how we collect the relevant data for our NLP project.

Sometime it's easily available to us. But sometime extra effort need to be done to collect these precious data.

1).Scrape web pages

To create an application that can summarizes the top news into just 100 words .For that you need to scrape the data from the current affairs websites and webpages.

2).Data Augmentation

NLP has a bunch of techniques through which we can take a small dataset and use some tricks to create more data. These tricks are also called data augmentation, and they try to exploit language properties to create text that is syntactically similar to source text data. They may appear as hacks, but they work very well in practice. Let’s look at some of them:

a).Back translation

Let say we have sentence s1 which is in French. We will translate it to other language (in this case English) and after translation it become sentence s2. Now we will translate this sentence s2 again to French and now it become s3. We’ll find that S1 and S3 are very similar in meaning but there is slight variations. Now we can add S3 to our dataset.

b).Replacing Entities

To create more dataset we will replace the entities name with other entities. Let say s1 is "I want to go to New York", here we will replace New York with other entity name for e.g. New Jersey.

c).Synonym Replacement

Randomly choose “k” words in a sentence that are not stop words. Replace these words with their synonyms.

d).Bigram flipping

Divide the sentence into bigrams. Take one bigram at random and flip it. For example: “I am going to the supermarket.” Here, we take the bigram “going to” and replace it with the flipped one: “to Going.”

TEXT CLEANING

After collecting data it is also important that data need to be in the form that is understood by computer. Consider the text contains different symbols and words which doesn't convey meaning to the model while training. So we will remove them before feeding to the model in an efficient way. This method is called Data Cleaning. Different Text Cleaning process are as follows:

HTML tag cleaning

Well when collecting the data we scrap through various web pages. Beautiful Soup and Scrapy, which provide a range of utilities to parse web pages.Hence the text we collect does not have any HTML tag in it.

from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
page = urllib.request.urlopen(url) # connect to website
try:
    page = urllib.request.urlopen(url)
except:
    print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
regex = re.compile('^tocsection-')
content_lis = soup.find_all('li', attrs={'class': regex})
content = []
for li in content_lis:
    content.append(li.getText().split('\n')[0])
print(content)

['1 History', '2 Basics', '3 Challenges', '3.1 Reasoning, problem solving', '3.2 Knowledge representation', '3.3 Planning', '3.4 Learning', '3.5 Natural language processing', '3.6 Perception', '3.7 Motion and manipulation', '3.8 Social intelligence', '3.9 General intelligence', '4 Approaches', '4.1 Cybernetics and brain simulation', '4.2 Symbolic', '4.2.1 Cognitive simulation', '4.2.2 Logic-based', '4.2.3 Anti-logic or scruffy', '4.2.4 Knowledge-based', '4.3 Sub-symbolic', '4.3.1 Embodied intelligence', '4.3.2 Computational intelligence and soft computing', '4.4 Statistical', '4.5 Integrating the approaches', '5 Tools', '6 Applications', '7 Philosophy and ethics', '7.1 The limits of artificial general intelligence', '7.2 Ethical machines', '7.2.1 Artificial moral agents', '7.2.2 Machine ethics', '7.2.3 Malevolent and friendly AI', '7.3 Machine consciousness, sentience and mind', '7.3.1 Consciousness', '7.3.2 Computationalism and functionalism', '7.3.3 Strong AI hypothesis', '7.3.4 Robot rights', '7.4 Superintelligence', '7.4.1 Technological singularity', '7.4.2 Transhumanism', '8 Impact', '8.1 Risks of narrow AI', '8.2 Risks of general AI', '9 Regulation', '10 In fiction', '11 See also', '12 Explanatory notes', '13 References', '13.1 AI textbooks', '13.2 History of AI', '13.3 Other sources', '14 Further reading', '15 External links']

Unicode Normalization:

While cleaning the data we may also encounter various Unicode characters, including symbols, emojis, and other graphic characters. To parse such non-textual symbols and special characters, we use Unicode normalization. This means that the text we see should be converted into some form of binary representation to store in a computer. This process is known as text encoding.

import emoji
text = emoji.emojize("Python is fun :red_heart:")
print(text)

Python is fun ❤

Text = text.encode("utf-8")
print(Text)

b'Python is fun \xe2\x9d\xa4'

Spelling Correction

The data that we have might have some spelling mistake because of fast typing the text or using short hand or slang that are used on social media like twitter. Using these data may not result in better prediction by our model therefore it is quite important to handle these data before feeding it to the model. we don’t have a robust method to fix this, but we still can make good attempts to mitigate the issue. Microsoft released a REST API that can be used in Python for potential spell checking.

System-Specific Error Correction

What if we need to extract the data from the PDF. Different PDF documents are encoded differently, and sometimes, we may not be able to extract the full text, or the structure of the text may get messed up. There are several libraries, such as PyPDF, PDFMiner, etc., to extract text from PDF documents but they are far from perfect.
Another common source of textual data is scanned documents. Text extraction from scanned documents is typically done through optical character recognition (OCR), using libraries such as Tesseract.

Recap

The first step in the process of developing any NLP system is to collect data relevant to the given task. Even if we’re building a rule-based system, we still need some data to design and test our rules. The data we get is seldom(rarely) clean, and this is where text cleaning comes into play.

1. Notes are compiled from Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems and morioh ↩

2. If you face any problem or have any feedback/suggestions feel free to comment.↩