Text Preprocessing Techniques

Imagine trying to understand a conversation in a noisy room. The background noise, accents, and rapid speech can make it difficult to comprehend the meaning. Similarly, raw text data is often noisy and unstructured, making it challenging for machines to extract meaningful insights. Preprocessing acts as a noise-cancellation technique, cleaning and organizing the data, making it easier for models to comprehend.

Text preprocessing encompasses all the preparatory steps applied to raw text data before it can be analyzed or processed by NLP models. It is a critical phase in Natural Language Processing, as the quality of preprocessing directly impacts the performance of downstream tasks such as sentiment analysis, classification, or machine translation.

In text analysis, far more time is spent in preparing and preprocessing the text data than in the analysis itself -

Dumais et al. 1998

In this article, we will explore the key techniques used in text preprocessing, including tokenization, stopword removal, stemming, lemmatization, and more. Using a real-world example, we will demonstrate how these methods transform raw textual data into a structured format suitable for analysis. Let’s dive into the details!

To fully understand the significance of preprocessing, let’s explore why this step is indispensable in NLP workflows

Why is Preprocessing Crucial in NLP?

Text preprocessing is crucial in NLP because raw text is often noisy and unstructured which makes it difficult for models to extract meaningful patterns. Preprocessing cleans and standardizes text, improving model accuracy and efficiency.

Noise Reduction

Removes irrelevant elements like punctuation, URLs, or special characters, simplifying text for analysis.

Raw text: “Check out this blog: ibraheemk.hashnode.dev”
Cleaned text: “Check out this blog”

Dimensionality Reduction

Techniques like stopword removal and stemming reduce the size of text data, making computations faster.

Raw text: “Cats are running around”
Processed text: “cat run around”

Standardization

Converts diverse inputs to a common format (e.g., case normalization, lemmatization).

Raw text: "Analyze" vs. "Analyzing"
Lemmatized: “analyze“

Context Preservation

Preprocessing like tokenization retains linguistic structure for better understanding.

Raw text: “I can cook“
Tokenized: ["I", "can", "cook"]

Together, these preprocessing techniques ensure that NLP models operate on clean, efficient, and meaningful data.

Now that we've outlined the importance of preprocessing, let’s apply these principles to a real-world example: preprocessing British Airways airline reviews.

The Task

The goal is to preprocess a dataset of airline reviews for British Airways. The reviews were scraped from the website AirlineQuality using the BeautifulSoup library. Below is the code used for web scraping and data collection.

base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 30
page_size = 100

reviews = []
dates = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    for date in parsed_content.find_all("time", {"itemprop": "datePublished"}):
        dates.append(date.get_text())

    print(f"   ---> {len(reviews)} total reviews")

After collecting the data, we’ll convert it into a dataframe with two columns: dates (when the review was published) and reviews (the actual text of the review).

raw_df = pd.DataFrame()
raw_df["dates"] = dates
raw_df["reviews"] = reviews

Let’s take a look at two sample reviews.

Sample 1

"✅ Trip Verified | We have sat on this plane for an hour and forty five minutes awaiting takeoff due to bad weather in London. This is understandable for safety. Fortunately I have a long layover so this delay does not affect me. However many others are not so lucky. While waiting we were given one small bottle of water and one tiny pack of corn kernels. Later food and drink were offered for purchase. This is my complaint. After sitting idle on this plane for nearly two hours and possibly missing connecting flights they can’t give us a free bag of chips and a coke? When I asked I was told no. Is British Airways really this cheap and money grubbing?"

Sample 2

"Not Verified |  British Airways stranding my wife and I at Heathrow Airport for 2.5 days, with no access to our baggage. We we told by airline employees to purchase any necessities (toiletries, refreshments, etc.) and we'd be reimbursed. That claim was denied, with BA claiming that weather was the reason for the delay (an outright lie since every other airline in London was flying on-time). We were unable to to anything besides stand in lines for vouchers, stand in lines for shuttles, stand in line for reservations, etc. for 48+ hours. It was an absolute nightmare and I will never fly BA again."

From these examples, we can observe some characteristics of the raw text:

Special characters: Emojis like "✅".
Verification status: Phrases like "Trip Verified" or "Not Verified" can be extracted into a separate column for analysis.

To address these preprocessing challenges effectively, we’ll leverage the re library, a powerful tool for text manipulation in Python

The `re` library

We will use the re library to remove emojis and special characters, and create a new column for verification status.

The re library in Python is used for Regular Expression operations. It’s essential for text preprocessing and manipulation tasks. It allows pattern matching, searching, splitting, and replacing text based on specific rules. Common functions include:

re.search(): Searches for a pattern in a string.
re.match(): Checks if a pattern matches at the beginning of a string.
re.findall(): Finds all matches of a pattern in a string.
re.sub(): Replaces occurrences of a pattern with a specified string.

In the following code snippet, we’ll use the re library to clean and preprocess text data by removing the emoji. Additionally, we’ll extract and categorize verification status (e.g., "Trip Verified," "Not Verified," or "Unknown") into a separate column for better organization.

new_list = []
for i in range(len(raw_df)):
    if "Trip Verified" in raw_df['reviews'][i] or "Not Verified" in raw_df['reviews'][i]:
        new_list.append([re.sub(r'✅ ', '',  part.strip()) for part in raw_df['reviews'][i].split("|")])
    else:
        new_list.append(['Unknown', raw_df['reviews'][i]])

df = pd.DataFrame(dates, columns=["dates"]).join(pd.DataFrame(new_list, columns=["verified", "review"]))
df.head()

Th re.sub(r'✅ ', '', part.strip() replaces the occurrence of the emoji with an empty string or nothing.

Looking at the review again, we would see that the emoji has been removed.

We have sat on this plane for an hour and forty five minutes awaiting takeoff due to bad weather in London. This is understandable for safety. Fortunately I have a long layover so this delay does not affect me. However many others are not so lucky. While waiting we were given one small bottle of water and one tiny pack of corn kernels. Later food and drink were offered for purchase. This is my complaint. After sitting idle on this plane for nearly two hours and possibly missing connecting flights they can’t give us a free bag of chips and a coke? When I asked I was told no. Is British Airways really this cheap and money grubbing?

While the re library is excellent for cleaning text, more advanced preprocessing tasks require a comprehensive library like NLTK

The `nltk` library

The Natural Language Toolkit (nltk) is one of the most widely used Python libraries for Natural Language Processing. It provides an extensive suite of tools and functionalities for working with text data, making it invaluable for preprocessing tasks in NLP projects.

Why Use `nltk` for Preprocessing?

Comprehensive Tools: It handles almost every preprocessing task, from cleaning and tokenization to stemming and lemmatization.
Ease of Use: NLTK is beginner-friendly, with intuitive methods and well-documented tutorials for each task.
Flexibility: It integrates seamlessly with other Python libraries such as NumPy and pandas, making it a powerful choice for preprocessing pipelines.

Steps for the Preprocessing Task

Cleaning: Removing punctuation/special characters
Standardization
Tokenization
Removing stopwords
Stemming and Lemmatization

We’ll explain each of these tasks using examples from the dataset. Afterward, we’ll write a function to apply these transformations across the entire dataset.

Text Cleaning

Text cleaning is the foundational step in preprocessing, where unnecessary elements like punctuation, special characters, and extra whitespaces are removed to ensure the text is in a clean, analyzable format. This step is crucial because raw text often contains noise that can confuse or hinder NLP models. We have already seen an example when we wanted to create a new column showing the status of the trip, verified or not verified. We used the re library to remove an emoji.

Punctuation (e.g., !, ?, #) and special characters (e.g., emojis or symbols like @ and $) add no significant meaning in most NLP tasks, such as sentiment analysis or topic modeling. They can distort tokenization by breaking words improperly or being treated as standalone tokens.

Also, whitespace characters (e.g., multiple spaces, tabs, or newline characters), which are characters that represent blank space, and are used to make text easier to read can interfere with tokenization, creating blank tokens or mismatched word indices during analysis.

Removing noise also reduces the dimensionality of the dataset, speeding up computations and improving model focus on meaningful features.

We can use the re library to remove whitespaces and special characters as shown in the code below:

text_cleaned = re.sub(r'[^\w\s]', '', text)
print(text_cleaned)

re.sub(pattern, replacement, string): Replaces all matches of the pattern with the specified replacement.
r'\[^\w\s\]': This is a regex pattern that matches any character that is not a word character (\w, which includes letters, numbers, and underscores) or whitespace (\s)

The result is shown below:

We have sat on this plane for an hour and forty five minutes awaiting takeoff due to bad weather in London This is understandable for safety Fortunately I have a long layover so this delay does not affect me However many others are not so lucky While waiting we were given one small bottle of water and one tiny pack of corn kernels Later food and drink were offered for purchase This is my complaint After sitting idle on this plane for nearly two hours and possibly missing connecting flights they cant give us a free bag of chips and a coke When I asked I was told no Is British Airways really this cheap and money grubbing

Once the text is cleaned of noise and extraneous elements, the next step is to ensure consistency through standardization.

Standardization

Text standardization is the process of converting text into a uniform format to reduce variability and ensure consistency. It eliminates discrepancies caused by differences in formatting, casing, or representation, which can confuse NLP models. For example, NLP models will treat "British" and "british" as distinct tokens unless standardized. Lowercasing ensures they are treated as the same entity. The number of unique tokens in the dataset are also reduced, which improves efficiency and minimizes memory usage.

In our task, we will focus on lowercasing. Lowercasing converts all letters in the text to lowercase which ensures uniformity regardless of the original casing. We can use Python’s lower() method to convert all the letters in a string to lowercase.

text_lower = text_cleaned.lower()
text_lower

The output will be in lowercase.

we have sat on this plane for an hour and forty five minutes awaiting takeoff due to bad weather in london this is understandable for safety fortunately i have a long layover so this delay does not affect me however many others are not so lucky while waiting we were given one small bottle of water and one tiny pack of corn kernels later food and drink were offered for purchase this is my complaint after sitting idle on this plane for nearly two hours and possibly missing connecting flights they cant give us a free bag of chips and a coke when i asked i was told no is british airways really this cheap and money grubbing

With standardized text, the focus shifts to breaking it down into meaningful units, a process known as tokenization.

Tokenization

Tokenization is the process of splitting text into smaller units called tokens (e.g., words, phrases, or sentences). These tokens serve as the foundation for most NLP tasks. Tokens can be words, subwords, or sentences, depending on the type of tokenization. There are three main types of tokenization:

Word tokenization: This splits a text into individual words. This is useful for tasks like word frequency analysis, stemming, or lemmatization. Here’s how to do this with nltk.

 from nltk.tokenize import word_tokenize
 tokens = word_tokenize(text_lower)
 print(tokens)

Running this code will give the output below:

 ['we', 'have', 'sat', 'on', 'this', 'plane', 'for', 'an', 'hour', 'and', 'forty', 'five', 'minutes', 'awaiting', 'takeoff', 'due', 'to', 'bad', 'weather', 'in', 'london', 'this', 'is', 'understandable', 'for', 'safety', 'fortunately', 'i', 'have', 'a', 'long', 'layover', 'so', 'this', 'delay', 'does', 'not', 'affect', 'me', 'however', 'many', 'others', 'are', 'not', 'so', 'lucky', 'while', 'waiting', 'we', 'were', 'given', 'one', 'small', 'bottle', 'of', 'water', 'and', 'one', 'tiny', 'pack', 'of', 'corn', 'kernels', 'later', 'food', 'and', 'drink', 'were', 'offered', 'for', 'purchase', 'this', 'is', 'my', 'complaint', 'after', 'sitting', 'idle', 'on', 'this', 'plane', 'for', 'nearly', 'two', 'hours', 'and', 'possibly', 'missing', 'connecting', 'flights', 'they', 'cant', 'give', 'us', 'a', 'free', 'bag', 'of', 'chips', 'and', 'a', 'coke', 'when', 'i', 'asked', 'i', 'was', 'told', 'no', 'is', 'british', 'airways', 'really', 'this', 'cheap', 'and', 'money', 'grubbing']

Sentence tokenization: This splits text into sentences. This is particularly helpful when analyzing documents or texts where sentence structure matters, such as summarization or sentiment analysis. In nltk, we import the sent_tokenize function like this: from nltk.tokenize import sent_tokenize.
Subword tokenization: This breaks text into smaller units, such as prefixes, suffixes, or syllables. It is widely used in modern NLP models like BERT or GPT to handle unknown words. For instance, the word “unbelievable” might be split as: ['un', 'believ', 'able']. This type is not available directly in nltk, but libraries like SentencePiece or Byte Pair Encoding (BPE) handle it.

Part-of-Speech (POS) tagging: After tokenization, the next step in some NLP workflows is Part-of-Speech tagging. This process assigns grammatical categories (such as noun, verb, adjective, etc.) to each token in a text. POS tagging provides valuable insights into the structure and meaning of a sentence and is often used in syntactic parsing, sentiment analysis, or even feature extraction for machine learning models. For example, in feature extraction, tags like nouns and verbs often carry more information in tasks like information retrieval or text classification.

To apply POS tagging to the tokens, we can use the pos_tag function in nltk.

print(nltk.pos_tag(tokens))

This will produce the following output:

[('sat', 'JJ'), ('plane', 'NN'), ('hour', 'NN'), ('forty', 'NN'), ('five', 'CD'), ('minutes', 'NNS'), ('awaiting', 'VBG'), ('takeoff', 'NN'), ('due', 'JJ'), ('bad', 'JJ'), ('weather', 'NN'), ('london', 'NN'), ('understandable', 'JJ'), ('safety', 'NN'), ('fortunately', 'RB'), ('long', 'RB'), ('layover', 'JJ'), ('delay', 'NN'), ('affect', 'VBP'), ('however', 'RB'), ('many', 'JJ'), ('others', 'NNS'), ('lucky', 'JJ'), ('waiting', 'VBG'), ('given', 'VBN'), ('one', 'CD'), ('small', 'JJ'), ('bottle', 'NN'), ('water', 'NN'), ('one', 'CD'), ('tiny', 'JJ'), ('pack', 'NN'), ('corn', 'NN'), ('kernels', 'NNS'), ('later', 'RB'), ('food', 'NN'), ('drink', 'NN'), ('offered', 'VBD'), ('purchase', 'NN'), ('complaint', 'NN'), ('sitting', 'VBG'), ('idle', 'JJ'), ('plane', 'NN'), ('nearly', 'RB'), ('two', 'CD'), ('hours', 'NNS'), ('possibly', 'RB'), ('missing', 'VBG'), ('connecting', 'VBG'), ('flights', 'NNS'), ('cant', 'JJ'), ('give', 'VBP'), ('us', 'PRP'), ('free', 'JJ'), ('bag', 'NN'), ('chips', 'NNS'), ('coke', 'VBD'), ('asked', 'VBN'), ('told', 'JJ'), ('british', 'JJ'), ('airways', 'NNS'), ('really', 'RB'), ('cheap', 'JJ'), ('money', 'NN'), ('grubbing', 'VBG')]

The output above is a list that contains tuples, where each tuple consists of a word and its corresponding part-of-speech (POS) tag. For example,

('sat', 'JJ'): The word "sat" is tagged as an adjective (JJ), although it could also be used as a verb depending on context.
('plane', 'NN'): The word "plane" is identified as a noun (NN).
('awaiting', 'VBG'): The word "awaiting" is recognized as a present participle or gerund (VBG).
('five', 'CD'): The word "five" is tagged as a cardinal number (CD).
('given', 'VBN'): The word "given" is tagged as a past participle verb (VBN).

Removal of Stopwords

Stopwords are commonly used words in a language, such as “and,” “the,” “is,” or “in.” While they are essential for grammatical structure, they usually carry little to no meaningful information for tasks like sentiment analysis or topic modeling. Stopwords add redundancy without contributing much value. For example, in the sentence “British Airways is a great airline,” words like “is” and “a” are less important than “British Airways” and “great.” As you might have also guessed, stopwords removal also reduces the size of the dataset, making computations faster.

Stopwords can be removed in two basic ways:

Predefined Lists: NLTK provides predefined lists of stopwords for many languages. For example, the English stopword list in NLTK includes around 179 words.

NLTK has stopwords for various languages. To use the English language stopwords, we import stopwords and then we can store them in a variable.

 from nltk.corpus import stopwords
 english_stopwords = stopwords.words('english')
 print(english_stopwords)

The english stopwords are:

 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

To remove the stopwords from the text, we can write a function to check if they exist in the text and then remove them.

 def remove_stopwords(text):
     output= [i for i in text if i not in english_stopwords]
     return output

This function can then be applied on the tokens:

 tokens_no_stopwords = remove_stopwords(tokens)
 print(tokens_no_stopwords)

We will get an output without stopwords as seen below

 ['sat', 'plane', 'hour', 'forty', 'five', 'minutes', 'awaiting', 'takeoff', 'due', 'bad', 'weather', 'london', 'understandable', 'safety', 'fortunately', 'long', 'layover', 'delay', 'affect', 'however', 'many', 'others', 'lucky', 'waiting', 'given', 'one', 'small', 'bottle', 'water', 'one', 'tiny', 'pack', 'corn', 'kernels', 'later', 'food', 'drink', 'offered', 'purchase', 'complaint', 'sitting', 'idle', 'plane', 'nearly', 'two', 'hours', 'possibly', 'missing', 'connecting', 'flights', 'cant', 'give', 'us', 'free', 'bag', 'chips', 'coke', 'asked', 'told', 'british', 'airways', 'really', 'cheap', 'money', 'grubbing']

Custom Lists: Depending on the context, you might need to add or remove words from the predefined list. This can be done using Python’s list methods. For instance, in our British Airways review dataset, words like “flight” or “airline” might appear frequently but add little value and could be added to the stopword list.

Should stopwords always be removed?

Ultimately, this will depend on your use-case. For example, in sentiment analysis, words like “not” or “no” are stopwords but are critical for negation (e.g., “not great” vs. “great”). Also, for tasks like machine translation or summarization, removing stopwords may lead to loss of grammatical integrity.

Stemming and Lemmatization

Before delving into the final preprocessing steps, it’s important to understand two foundational concepts: syntax and semantics. Syntax refers to the structure of sentences, encompassing grammar and parts of speech (e.g., nouns, verbs, and adjectives) that define the roles words play within a sentence. In contrast, semantics deals with the meaning conveyed by words and phrases.

Stemming and lemmatization are closely tied to these concepts, as they aim to simplify words while retaining their syntactic and semantic roles, ensuring the text is both compact and meaningful for analysis. These techniques are used to reduce words to their base or root forms. They are essential for handling linguistic variations in text, ensuring that words like “run,” “running,” and “ran” are treated as a single concept.

Stemming

This is a rule-based process that removes suffixes and prefixes from words to obtain the root form. However, this root form (known as a "stem") may not always be a valid word.

NLTK provides several popular stemming algorithms, each with unique characteristics. The Porter Stemmer is one of the oldest and most widely used stemmers. It employs a set of heuristic rules to strip suffixes and reduce words to their root form, making it efficient and reliable for general use. For instance, "running" would be reduced to "run." The Lancaster Stemmer, on the other hand, is more aggressive, often producing shorter stems. While it is faster, it can sometimes over-stem words, leading to a loss of meaning; for example, "maximum" might be reduced to "maxim."

In NLTK, we can use the Porter Stemmer as shown below.

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

tokens_stemmed = [stemmer.stem(word) for word in tokens_no_stopwords]
print(tokens_stemmed)

This will give the output shown below:

['sat', 'plane', 'hour', 'forti', 'five', 'minut', 'await', 'takeoff', 'due', 'bad', 'weather', 'london', 'understand', 'safeti', 'fortun', 'long', 'layov', 'delay', 'affect', 'howev', 'mani', 'other', 'lucki', 'wait', 'given', 'one', 'small', 'bottl', 'water', 'one', 'tini', 'pack', 'corn', 'kernel', 'later', 'food', 'drink', 'offer', 'purchas', 'complaint', 'sit', 'idl', 'plane', 'nearli', 'two', 'hour', 'possibl', 'miss', 'connect', 'flight', 'cant', 'give', 'us', 'free', 'bag', 'chip', 'coke', 'ask', 'told', 'british', 'airway', 'realli', 'cheap', 'money', 'grub']

We can see that words like “minutes“ have been stemmed to “minut“ and “fortunately“ to “fortun“.

Stemming is a simple and fast technique and it is very useful for applications where meaning is less critical. However, the major drawback as shown in the examples above, is that it often produces non-meaningful stems and strips word variations aggressively which leads to loss of linguistic meaning. Thankfully, lemmatization is a technique that can solve this issue.

Lemmatization

Lemmatization uses vocabulary and grammar rules to reduce words to their base form (lemma), ensuring that the result is a valid word. It considers the context, part of speech (POS), and root meaning of the word. Lemmatization solves the problems with stemming by using a dictionary to ensure outputs are valid words. Lemmatization also considers the word’s role in the sentence.

Using NLTK’s WordNetLemmatizer, we can lemmatize text as follows:

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

tokens_lemmatized = [lemmatizer.lemmatize(word) for word in tokens_no_stopwords]
print(tokens_lemmatized)

This will print out the list as shown below.

['sat', 'plane', 'hour', 'forty', 'five', 'minute', 'awaiting', 'takeoff', 'due', 'bad', 'weather', 'london', 'understandable', 'safety', 'fortunately', 'long', 'layover', 'delay', 'affect', 'however', 'many', 'others', 'lucky', 'waiting', 'given', 'one', 'small', 'bottle', 'water', 'one', 'tiny', 'pack', 'corn', 'kernel', 'later', 'food', 'drink', 'offered', 'purchase', 'complaint', 'sitting', 'idle', 'plane', 'nearly', 'two', 'hour', 'possibly', 'missing', 'connecting', 'flight', 'cant', 'give', 'u', 'free', 'bag', 'chip', 'coke', 'asked', 'told', 'british', 'airway', 'really', 'cheap', 'money', 'grubbing']

Comparing this result with the output of the stemming operation, we can clearly see that the lemmas make more sense linguistically.

By reducing words to their roots or base forms, stemming and lemmatization ensure that linguistic variations do not compromise the accuracy or effectiveness of analysis.

Combining Preprocessing Steps

After applying all preprocessing techniques (cleaning, standardization, tokenization, stopword removal, stemming, or lemmatization), the text becomes compact and suitable for further analysis. We will incorporate everything into a Python function for streamlining and reusability. This function will take the dataframe as input and return the cleaned and preprocessed text, ready for tasks such as classification, clustering, or sentiment analysis.

def process_reviews(dataframe):
    lemmatizer = WordNetLemmatizer()

    processed_reviews = []

    for review in dataframe['review']:
        review = re.sub(r'[^\w\s]', '', review)
        review = review.lower()
        review = word_tokenize(review)
        review = remove_stopwords(review)
        review = [lemmatizer.lemmatize(word) for word in review]

        processed_reviews.append(review)

    return processed_reviews

df['processed_reviews'] = process_reviews(df)

This pipeline ensures that the text is clean, uniform, and ready for analysis, such as sentiment classification or topic modeling.

Conclusion

In NLP, text preprocessing is not a one-size-fits-all solution and must be carefully tailored to the specific task. While removing punctuation or stopwords can simplify text and improve model efficiency, it can also strip away important context. For instance, punctuation often carries meaning, and stopwords may provide critical cues in sentiment or sarcasm detection. Ignoring word order can further confuse models, especially in tasks like machine translation.

Challenges like sarcasm, irony, or subtle differences in text highlight the limitations of traditional preprocessing. Statements like "Oh great, another rainy day" might lose their sarcastic tone if preprocessing removes punctuation or overly simplifies the input.

The key takeaway is that preprocessing should align with the specific needs of the task. For example, while minimal preprocessing may suit advanced models like BERT (Bidirectional Encoder Representations from Transformers), which capture context well, simpler models may require more structured input. Regular error analysis can help identify when preprocessing improves or hinders performance, guiding adjustments to create more effective pipelines.

Thank you for taking your time to read this article. I hope you found it insightful! If you enjoyed it, please consider liking, sharing, or commenting with your thoughts and questions. Your engagement not only supports this work but also helps others discover valuable content.

For further reading, check out:

[Article] Text Preprocessing Techniques for beginners by Crypt(iq)
[Article] Text Preprocessing in NLP with Python Codes by AnalyticsVidhya
[Book] Practical Text Analytics by Murugan Anandarajan, Chelsey Hill & Thomas Nolan
[Book] Speech and Language Processing (3rd ed. draft) by Dan Jurafsky and James H. Martin
[Docs] NLTK documentation
[Docs] re documentation

Text Preprocessing Techniques

Table of contents

Why is Preprocessing Crucial in NLP?

Noise Reduction

Dimensionality Reduction

Standardization

Context Preservation

The Task

The re library

The nltk library

Why Use nltk for Preprocessing?

Steps for the Preprocessing Task

Text Cleaning

Standardization

Tokenization

Removal of Stopwords

Stemming and Lemmatization

Combining Preprocessing Steps

Conclusion

The `re` library

The `nltk` library

Why Use `nltk` for Preprocessing?