Natural Language Processing: A Primer

I’ve been fascinated by the possibility of extracting knowledge from large bodies of text using computational methods since… well, since I’ve started reading scientific literature. Natural Language Processing (NLP) is a branch of machine learning that focuses on the interaction between computers and humans through natural language. The ultimate objective of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. This note will give an overview of the basic NLP concepts and methods, and will give practical examples using Python.

What do we need?

The two main Python libraries we will need are the Natural Language Toolkit (NLTK) and Scikit-learn (sklearn).

NLTK is a powerful library designed to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources, such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Scikit-learn is a huge general-purpose collection of powerful and user-friendly machine learning tools. It provides simple and efficient interfaces for data analysis and modeling. It includes a wide range of algorithms for classification, regression, clustering, and dimensionality reduction, along with utilities for data preprocessing, model selection, and evaluation. And, of course, it has a few useful tools for NLP.

Setup

Installation:

pip install nltk
pip install -U scikit-learn

The first thing you will probably want to do is to download specific data sets and corpora used by NLTK in processing. You can do this using the interactive downloader:

import nltk
nltk.download()

This will open a GUI from which you can choose what to download. It’s common to start with the packages from the “popular” section, and expand them as needed.

Data Preprocessing

Like every data analysis project, NLP tasks typically begin with data cleaning, preparation, and preprocessing. Common steps include:

Lowercasing, to remove variability. Unless the case can hold significant semantic meaning (e.g., “amigo” the term of endearment vs. “AMIGO” the gene name).
Tokenization, the breaking up of the text into smaller units, usually words, subwords, or morphemes. In some cases, tokenization can be done on larger units, like sentences and paragraphs.
Stopword filtering, the removal of frequent words (e.g., articles, prepositions) or punctuation that are deemed insignificant. In some cases, instead of complete removal, some of the most frequent tokens are subsampled to decrease their frequency but still retain some examples.

Depending on the application, preprocessing sometimes also includes:

Stemming, the reduction of words to their base or root form, often by stripping common suffixes, e.g., “computational” –> “comput”.
Lemmatization, reduction of words to their canonical forms (lemma) by using vocabulary and morphological analysis to remove inflections, e.g., “mice” –> “mouse”, or “better” –> “good”.

Now let’s get our hands ~~dirty~~ slovenly! As I am a big fan of recursion, in our practical examples I will be using this very note as a specimen of a document for processing.

Let’s start with the first three steps:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z]', ' ', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [word for word in tokens if len(word)>2]
    return tokens

document = "\n".join(open("nlp-primer.md", "r", encoding="utf8").readlines())

tokens = preprocess(document)

The function above takes any text as input and starts off by converting it all to lowercase. Next, it uses regex to replace non-alphabetic characters with a space, effectively removing numbers, punctuation, and any of the annoying markup. After cleaning the text, it uses the NLTK’s word tokenizer to break up the string into individual word tokens. Following tokenization, it initializes a set of English stop words from the NLTK corpora, and filters them out from the token list using a standard list comprehension. Lastly, it removes any leftover junk of single character tokens.

Note, when dealing with languages other than English, you may want to use python’s .casefold() string method, instead of .lower(), as it is a little more aggressive in removing language-specific case distinctions.

For some applications, you may want to tokenize on different levels. In addition to the word_tokenize function, the nltk.tokenize module has the following popular options available:

sent_tokenize - splits text into sentences
WhitespaceTokenizer - splits text on whitespace characters
RegexpTokenizer - splits text into substrings based on a regex pattern

If your application needs to ignore common variations of each word, you can add stemming or lemmatization.

Here is an example of the popular Porter Stemmer in action:

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in tokens]

Here are a few examples of non-stemmed and stemmed tokens from this note:

Original token: ‘primer’, ’natural’, ’languages’, ‘processing’, ‘process’, ‘corpora’
After stemming: ‘primer’, ’natur’, ’languag’, ‘process’, ‘process’, ‘corpora’

Stemming usually uses simple algorithms to identify and remove common word prefixes and suffixes. While they are computationally efficient, stemming algorithms are less accurate, often producing results that are not actual words. As in the example above, they can also reduce different words to the same stem (as with “processing” and “process”), often leading to significant loss of context.

Lemmatization achieves similar goals by different means. Lemmatization algorithms involve more sophisticated knowledge-based procedures that heavily rely on vocabularies and morphological analysis to return the base or dictionary form of a word without additional inflections. They are generally more accurate but computationally expensive.

As an example:

from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in tokens]

To compare, let’s look at the same examples as above, now before and after lemmatization:

Original token: ‘primer’, ’natural’, ’languages’, ‘processing’, ‘process’, ‘corpora’
After lemmatization: ‘primer’, ’natural’, ’language’, ‘processing’, ‘process’, ‘corpus’

As you can see, the lemmatization results look much more like real words. In addition, we can still distinguish between words like “process” and “processing”, and it did a great job at standardizing an unusual word like “corpora”. However, to stem all tokens from this article, it took my laptop about 8 ms, while the lemmatization took 920 ms, a hefty difference when you find yourself processing a large amount of data.

Exploration

Now that we’ve prepared the data, let’s apply some common explorations techniques to get a sense of what we are working with. Before we begin, it would be useful to convert our document to NLTK’s text class, which exposes some simple, interactive interfaces to the some of the most popular functionality. On the tokenized version of your text, call:

text = nltk.Text(tokens)

Let’s begin by calculating a few simple statistics about our text:

word_count = len(tokens)
print("Word Count:", word_count)

average_word_length = sum(len(word) for word in tokens) / word_count
print("Average Word Length:", average_word_length)

unique_words = set(tokens)
vocabulary_richness = len(unique_words) / word_count
print("Vocabulary Richness:", vocabulary_richness)


>> Word Count: 2061
Average Word Length: 6.396894711305191
Vocabulary Richness: 0.35419699175157693

A lot of data exploration in NLP starts with frequency-based analysis. It often serves as a basis for more complex tasks such as topic modeling, sentiment analysis, and feature extraction. Frequency analysis works on the assumption that the frequency of a word within a document has some relationship to its importance within the document. We should be careful with this assumption, as it may lead us to believe that the word “the” is the most important and meaningful word in the English language. In fact, there is an argument to be made that the least common words often carry the most meaning. However, looking at frequency distributions of words in our documents can be quite satisfying.

We will begin by creating a frequency distribution from our tokenized text:

from nltk.probability import FreqDist
fdist = FreqDist(text)

If we look inside our newly created frequency distribution, you will see that it is essentially a dictionary keyed with the unique words (terms) in our document, and valued with their counts within the document. To display the frequency of most frequent words in our text:

fdist.most_common(10)

>> [('word', 108), ('text', 50), ('document', 42),
 ('idf', 33), ('nltk', 31), ('token', 25),
 ('python', 24), ('doc', 24), ('frequency', 22),
 ('language', 21)]

Sidenote: The frequency distribution above is a precursor, at least conceptually, to the Bag of Words (BOW) technique for feature extraction (to be discussed later).

You can get a better sense of the different contexts in which a specific word appears in the text through the concordance method:

text.concordance("word")

>> """ ing text smaller unit usually word subwords morpheme case tokenization 
 stopword filtering removal frequent word article preposition punctuation deem
mes also includes stemming reduction word base root form often stripping commo
ional comput lemmatization reduction word canonical form lemma using vocabular
mport stopwords nltk tokenize import word tokenize def preprocess text text te"""

If you are interested in the frequency of sequences of that co-occuring words, you can look at collocations, with bigrams (two-word collocations) being most commonly used.

from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

bigram_finder = BigramCollocationFinder.from_words(text)
bigrams = bigram_finder.nbest(BigramAssocMeasures.likelihood_ratio, 4)

print("Bigrams:", bigrams)

>> Bigrams: [('natural', 'language'), ('tokenized', 'doc'), ('scikit', 'learn'), ('feature', 'extraction')]

To generate a list of all n-gram combinations, you can use the ngrams utility:

from nltk.util import ngrams

trigrams = list(ngrams(text, 3))
print("Trigrams:", trigrams)

Parts of Speech Tagging

Parts of Speech (POS) tagging is the process of labeling each word in a sentence with its corresponding part of speech (such as noun, verb, adjective, etc.). This is a crucial step in many NLP tasks, such as parsing, information extraction, and machine translation.

Here’s a simple example:

from nltk import pos_tag
tagged = pos_tag(text)

NLTK uses the Penn Treebank POS tags. Here are some common tags and their meanings:

CC: Coordinating conjunction
CD: Cardinal digit
DT: Determiner
EX: Existential there (like “there is”)
FW: Foreign word
IN: Preposition/subordinating conjunction
JJ: Adjective
JJR: Adjective, comparative
JJS: Adjective, superlative
LS: List marker
MD: Modal (could, will)
NN: Noun, singular
NNS: Noun, plural
NNP: Proper noun, singular
NNPS: Proper noun, plural
PDT: Predeterminer
POS: Possessive ending
PRP: Personal pronoun (I, me, you)
PRP$: Possessive pronoun (my, your)
RB: Adverb
RBR: Adverb, comparative
RBS: Adverb, superlative
RP: Particle (give up)
TO: To go “to” the store.
UH: Interjection
VB: Verb, base form
VBD: Verb, past tense
VBG: Verb, gerund/present participle
VBN: Verb, past participle
VBP: Verb, sing. present, non-3d
VBZ: Verb, 3rd person sing. present
WDT: Wh-determiner (which, that)
WP: Wh-pronoun (who, what)
WP$: Possessive wh-pronoun (whose)
WRB: Wh-abverb (where, when)

Chunking and Chinking

Chunking and chinking are used to identify and segment specific parts of text, typically involving phrases. While POS tagging assigns parts of speech to individual words, chunking groups these tagged words into meaningful chunks, such as noun phrases or verb phrases. Chinking, on the other hand, is the process of removing specific tokens from chunks, refining the chunks further. These techniques are used to identify entities, relationships, and facts in the text, and often serve as precursors for tasks like text summarization.

Chunking, also known as shallow parsing, focuses on grouping words into meaningful units, like noun phrases (NP) or verb phrases (VP). The most common use of chunking is to find non-overlapping chunks in a sentence.

from nltk import RegexpParser

grammar = """
NP: {<DT>?<JJ>*<NN>}
}<VB|IN>+{
"""
chunk_parser = RegexpParser(grammar)
chunked = chunk_parser.parse(tagged_tokens)

print(chunked)

>> (NP large/JJ body/NN)
   (NP text/NN)
   (NP scientific/JJ literature/NN)
   (NP natural/JJ language/NN)...

The example above identifies and groups nouns and the words that modify them into noun phrases. The RegexpParser class takes the grammar as a pseudo-regex pattern and applies it to the POS-tagged tokens to create chunks. The chunking grammar NP: {<DT>?<JJ>*<NN>} defines a noun phrase (NP) as an optional determiner (<DT>?), followed by zero or more adjectives (<JJ>*), and ending with a noun (<NN>). The chinking grammar }<VB|IN>+{ specifies that verbs (<VB>) and prepositions (<IN>) should be excluded from the chunks. Chinking can be useful when the chunking grammar captures too much, and you need to exclude certain words.

Named Entity Recognition

Named Entity Recognition (NER) involves identifying and classifying named entities mentioned in the text into predefined categories such as person names, organizations, locations, dates, and more. NER helps in extracting meaningful information from text, enabling better understanding and analysis. It is important for text summarization, categorization, and extracting structured information from unstructured text data

NLTK provides limited support for NER using its ne_chunk function, which performs chunking and NER on POS-tagged text. However, to provide the algorithm the right context size, we first need to tokenize our text on a sentence level:

sentences = sent_tokenize(document)
named_sentences = []
for sentence in sentences:
    words = word_tokenize(sentence)
    tagged = pos_tag(words)
    named_entities = ne_chunk(tagged)
    named_sentences.append(named_entities)

named_sentences[5]
>> Tree('S', [('*', 'NN'), ('*', 'NNP'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Toolkit', 'NNP'), ('(', '('), ('[', 'NNP'), ('NLTK', 'NNP'), (']', 'NNP'), ('(', '('), ('https', 'NN'), (':', ':'), ('//www.nltk.org/', 'NN'), (')', ')'), (')', ')'), ('*', 'FW'), ('*', 'FW'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), Tree('PERSON', [('Python', 'NNP')]), ('library', 'NN'), ('designed', 'VBN'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'), ('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')])

You can see that it identified “Python” as a person. This example illustrates that NLTK’s NER support is somewhat basic, and that more advanced NER tasks should be performed using pre-trained or custom-trained models with additional libraries like SpaCy or sklearn-crfsuite.

Feature Extraction

Feature extraction is the process of transforming raw data into numerical features that can be used for machine learning algorithms. In the context of text data, feature extraction involves converting textual data into numerical representations. This section summarizes the variety of tools and techniques for feature extraction that can be achieved with NLTK or adjacent libraries.

Bag of Words

The Bag of Words (BOW) model represents text as a collection of word frequencies. In essence, it is very similar to the frequency distribution we created earlier in the exploration section, but with an added step of converting this distribution into a sparse matrix representation. We can do this quite efficiently using the CountVectorizer class from scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)
bag_of_words_matrix = X.toarray()

This creates a matrix with dimensions = Number of words in the document × Size of the vocabulary. Meaning that each row (representing the position of each word in the document) will contain a single value = 1 in the column corresponding to the specific entry in the document’s vocabulary. To get the vocabulary in the associated positions:

print(vectorizer.get_feature_names_out())

Term Frequency-Inverse Document Frequency

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure (or a score) used to evaluate the importance of a word in a document relative to a collection or corpus of documents. It is often used in information retrieval, text mining, and keyword extraction as a weighting factor during both the indexing and retrieval phases. TF-IDF helps to adjust for the fact that some words appear more frequently in general and provides a way to normalize the frequency of words so that the more important words are properly weighted.

As in previous examples, let’s imagine that we have a document $ d $ (e.g., this note) from a collection of documents $ D $ (e.g., this blog). We also have some terms of interest $ t $ (e.g., “language” and “data”), and we would like to know which of these terms is more representative of the document $ d $.

As the name suggests, TF-IDF measure consists of two major components:

Term Frequency (TF) measures the frequency of a word within a document. It is the ratio of the number of times a word appears in a document compared to the total number of words in that document. The idea is that the more often a word appears in a document, the more important it is for the document.

$$ TF(t, d) = \frac{f_{t,d}}{\displaystyle \sum_{t^{\prime} \in d} f_{t^{\prime}, d}} $$

where:

$ f_{t,d} $ is the count of the term $ t $ in a document $ d $
$ \displaystyle \sum_{t^{\prime} \in d} f_{t^{\prime}, d} $ is the total number of terms $ t^{\prime}$ in the document $ d $

Inverse Document Frequency (IDF) measures the importance of the word across a set of documents (corpus). The logic behind IDF is that a word is not very informative if it appears in too many documents. Thus, the IDF of a word increases with the rarity of the word across documents.

$$ IDF(t, D) = \log\left(\frac{N}{1 +|{d \in D : t \in d}|}\right) $$

where:

$ N $ is the total number of documents in the corpus $ D $.
$ |{d \in D : t \in d}| $ is the number of documents where the term $ t $ appears. It is often adjusted by $ +1 $ to avoid a division by 0 when a term is not found in any documents.

The TF-IDF score is then computed as the product of TF and IDF: $$ TFIDF(t, d, D) = TF(t, d) \times IDF(t, D) $$

The resultant scores represent the relative importance of a term in a specific document out of a collection of documents. The higher the TF-IDF score, the rarer the term is in the given document corpus and potentially the more relevant it is in that particular document.

import math
from collections import defaultdict

corpus = # load the documents as a list

tokenized_docs = [word_tokenize(doc.lower()) for doc in documents]

def compute_tf(tokenized_doc):
    tf_dict = defaultdict(int)
    for word in tokenized_doc:
        tf_dict[word] += 1
    tf_dict = {word: count / len(tokenized_doc) for word, count
              in tf_dict.items()}
    return tf_dict

def compute_idf(tokenized_docs):
    idf_dict = defaultdict(int)
    N = len(tokenized_docs)
    all_words = set([word for doc in tokenized_docs for word in doc])
    
    for word in all_words:
        containing_docs = sum(1 for doc in tokenized_docs if word in doc)
        idf_dict[word] = math.log(N / (1 + containing_docs))
    
    return idf_dict

def compute_tf_idf(tf_dict, idf_dict):
    tf_idf_dict = {word: tf_val * idf_dict[word] for word, tf_val 
                  in tf_dict.items()}
    return tf_idf_dict

tf_docs = [compute_tf(doc) for doc in tokenized_docs]
idf_dict = compute_idf(tokenized_docs)
tf_idf_docs = [compute_tf_idf(tf, idf_dict) for tf in tf_docs]

for i, tf_idf in enumerate(tf_idf_docs):
    print(f"Document {i + 1} TF-IDF:")
    print(tf_idf)
    print()

This implementation provides a basic understanding of how TF-IDF works and can be customized or extended for more complex text processing tasks. For the less patient, we can also use a ready-made scikit-learn TfidfVectorizer class to encode the TF-IDF score into a sparse matrix representation to be used as ML features:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
tfidf_matrix = X.toarray()

print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix)

Wrap-Up

In conclusion, Natural Language Processing (NLP) is a transformative technology that enhances the way we interact with and understand vast amounts of textual data. By enabling computers to interpret, generate, and derive meaning from human language, NLP bridges the gap between human communication and machine intelligence. This capability is crucial across various domains, from improving search engines and translation services to automating customer support and enhancing data-driven decision-making processes. As the volume of unstructured text data continues to grow, the importance of NLP will only increase, driving innovations and efficiencies in numerous industries. Embracing NLP techniques and tools empowers us to unlock deeper insights and create more intelligent, responsive systems that can adapt to the complexities of human language.