Artificial Neural Networks: Embedding Models

Before we can process language-based data using Artificial Neural Networks (ANNs), we need to convert this data into some kind of a numerical representation. Embedding models are designed for this purpose. They transform language data into dense high-dimensional vectors that preserve the semantic associations between words. These vectors capture the essence of language data in a way that computers can understand and process. This note explores the most popular embedding model architectures, looks into how these models are trained, and discusses their critical role in Natural Language Processing (NLP).

This note is part of a series on Artificial Neural Networks:

NLP Concepts

This note requires some understanding of basic NLP concepts. Before proceeding, make sure that you are well primed on NLP basics.

Language as Numbers

Before the advent of embedding models, most NLP routines achieved text-to-numerical conversion using local representations, where each entity is represented as an isolated unique identifier. This is generally achieved by one-hot encoding, where a word is represented by a vector with the size of the entire vocabulary (usually with millions of dimensions). Nearly all values in such vectors are equal to 0, except a single value equaling to 1 at an index uniquely representing the encoded word. One-hot encoded vectors, while simple and intuitive, come with several disadvantages. First, these representations rely on vectors with an extremely high dimensionality, especially for languages with large vocabularies. This can lead to memory inefficiencies and computational challenges that stem from the “curse of dimensionality”. Second, one-hot encoding does not capture any semantic or syntactic relationships between words, meaning that each word is treated as an entirely independent entity. This lack of contextual information prevents the model from understanding nuances, such as similarities between words or their meanings in different contexts. Embedding models are designed to address these issues by foregoing the sparse-vector representations, and instead producing dense-vector distributed representations.

Projections on a Manifold

Embedding models operate on the distributional hypothesis, which states that words with similar meanings tend to appear in similar contexts. Embedding models project data in the form of dense vectors onto a high-dimensional manifold, usually containing hundreds to thousands of dimensions. Each of these dimensions represent some semantic quality of a word. As a highly oversimplified example, one dimension can reflect how “purple” a word might be, while another can reflect how much a word relates to a concept of a “house”. In other words, each dimension represents a latent variable that combines specific features of the data into a quantifiable representation of a specific relationship. In this sense, embeddings achieve a distributed representation of words across the multiple latent dimensions. Since, embedding models are trained to map words based on their contextual relationships i.e., how often certain words appear in similar contexts, this leads to mappings where semantically similar words get an overall shorter distance between their vectors in the embedded space. Thus, vector similarity measures (e.g. cosine similarity), or distance metrics, can be used to measure and query semantic relationships between words.

Data Preparation

Embedding models are trained on large corpora of text: large enough to represent the variety of contexts that each word can exhibit. Before model training can begin, the training data needs to be cleaned up and preprocessed. The preparation generally follows the usual NLP data preprocessing steps. It should be noted, however, that the removal of stopwords is somewhat of a contentious topic in embedding model training. In some applications, this can improve performance, but we have to be careful about the choice of stopwords. Words that carry semantic meaning, like negations, should not be filtered. Sub-sampling is a much more common approach to reduce the noise in the dataset for this application.

Models

Word2Vec

Word2Vec is a family of efficient embedding models that can be trained using one of two architectures:

Continuous Bag-of-Words (CBOW) predicts a target word based on a set of context words. Just like in the standard BOW model, the order of words is ignored.
Skip-Gram works in the opposite manner by predicting the most probable context words given a single target word. Unlike in CBOW, context words that are more proximate to the target are weighed more heavily. It is more effective in capturing detailed relationships from smaller datasets, but it is more computationally expensive.

The structure of a basic Word2Vec model is surprisingly simple: an input layer is followed by a single linear “projection” hidden layer, which is then followed by a softmax output layer.

flowchart LR

inputs["Inputs \n size: 1 × V \n one or multiple vectors"]
embeddingM[["Embedding Matrix \n size: V × N"]]
projection["Projection Layer \n size: N × 1"]
unembeddingM[["Un-embedding Matrix \n size: N × V"]]
output["Output Vector \n size: V × 1"]

inputs --> embeddingM --> projection --> unembeddingM --"softmax"--> output

Let’s first look at the CBOW architecture in more detail. At the input we start with a set of context words, each encoded into a one-hot vector of size V (the size of the vocabulary). Each of the context word vectors is multiplied by a weight matrix of size V × N, where N is the desired dimensionality of the embedding representation (usually 100-300). This “embedding matrix” is the most important piece of the model: once it is trained, the rows of this matrix end up holding the embedded representations of each word in the vocabulary. Multiplying this matrix by a one-hot encoded word vector essentially performs a filtering operation for the embedding representation of the word, essentially making it an embedding lookup table. It should be noted that during training the same embedding matrix is shared between all of the input context words. As a result, the matrix multiplication products of each of the context words are averaged together in a single vector, producing their composite representation. This averaging is the only operation performed in the “projection” layer, no use of non-linear functions required. This is one of the defining innovations that Word2Vec models introduced, significantly reducing their computational complexity. For model training purposes, this projection layer then gets multiplied by an “un-embedding” weight matrix of size N × V, which converts the contents of the projection layer back into the dimensionality of the vocabulary. This results in an intermediate output vector, which is then converted into the final output probability layer using softmax. The training of the model then typically follows the standard backpropagation approach with stochastic gradient descent optimization.

The skip-gram model (or fill-in-the-blank model) works in a similar way, but with a few key differences. At the input, we start with a single one-hot encoded target word vector. As in the CBOW case, the input is multiplied by the embedding matrix, but since there is only one input, there is no need to perform any averaging in the projection layer. Again, an “un-embedding” matrix is used to convert the representation back into the vocabulary dimensionality, and softmax is used to predict a single context word. In training, one target word is forward propagated, and then multiple context words are used to update the weight matrixes by backpropagation, one at a time. This, in a sense, achieves the same kind of composite representation of the multiple context words as was achieved by averaging in CBOW. However, in this scheme we can weight more proximal words higher by using subsampling. For example, if our defined context window is of size C, during training the window size for each training set gets randomly chosen to be between 1 and C. This means that proximal context words get sampled more often than distant words, and thus, have a higher weight in the composite representation.

GloVe

The GloVe (Global Vectors for Word Representation) model works by constructing word embeddings through the analysis of word co-occurrence statistics across a large corpus of text. GloVe captures the meaning of a word by leveraging global statistical information about word pairs. To achieve this, GloVe creates a word-word co-occurrence matrix, where each entry represents the number of times a word pair appears together within a specific context window (omitting the 0 co-occurrence entries). While somewhat computationally expensive, this co-occurrence matrix can be generated over a single pass through over the corpus. The efficiency of the model is recovered in training, since the number of co-occurrences is generally smaller than the total size of the corpus.

Once the co-occurrence matrix is built, GloVe optimizes a cost function to factorize this matrix, effectively transforming the sparse co-occurrence data into dense word vectors. The optimization process aims to find word vectors such that the dot product of two word vectors approximates the logarithm of their co-occurrence probability. Specifically, the cost function penalizes deviations between the dot product of the word vectors and the logarithm of the corresponding co-occurrence frequency, with a weighting function to ensure that rare and frequent co-occurrences are treated appropriately.

During training, the GloVe model iteratively adjusts the word vectors to minimize the cost function, resulting in word embeddings that capture both syntactic and semantic relationships. Words that frequently co-occur with similar words are positioned close to each other in the vector space, while words with different contexts are placed farther apart. This approach allows GloVe embeddings to effectively represent complex linguistic patterns and relationships, providing a powerful tool for various natural language processing tasks.

Compared to Word2Vec, GloVe is more computationally expensive, but is able to capture more accurate contextual information by leveraging the overall statistical properties of the corpus. While Word2Vec is very efficient at capturing local context within sentences, it often struggles with capturing global context of the corpus.

ELMo

Word2Vec and GloVe models provide static embeddings which use a single vector representation for each word regardless of its context. ELMo (Embeddings from Language Models), on the other hand, generates dynamic word embeddings that vary depending on the word’s context within a sentence, effectively capturing polysemy and disambiguating word meanings based on their usage. This is achieved through a multi-layer bidirectional LSTM (Long Short-Term Memory) network. The model is pre-trained on a large corpus of text, where it learns to predict the next word in a sequence (forward direction) and the previous word in a sequence (backward direction). Let’s look at each step in detail:

ELMo does not use word one-hot encoding at the input, instead it converts character tokens from the input into dense vector representations using a character-level Convolutional Neural Network (CNN). This character-based representation helps in handling out-of-vocabulary words (including missspellings) and capturing morphological information. This character-level CNN is quite interesting on its own, so let’s briefly look at its structure:
- At the input, one-hot encoded character vectors are converted into embeddings using a simple character-embedding layer that is very reminiscent of the Word2Vec structure.
- The convolutional layer then applies a series of filters (kernels) to the sequence of character embeddings. Each filter slides over the sequence, performing element-wise multiplications and summing the results to produce a feature map. These filters are trained to capture local patterns in the character sequence, such as prefixes, suffixes, and other morphological structures. The convolution layer is typically followed by a non-linear activation function like ReLU.
- A pooling layer, such as max pooling, is often applied after the convolutional layer. Pooling reduces the dimensionality of the feature map by selecting the maximum value within a specified window, effectively summarizing the presence of a feature in a particular region of the input.
- The pooled feature maps are then flattened and passed through one or more fully connected layers to produce the final character-level representation of the word. This representation captures the morphological and structural information of the word, which can be combined with other word-level features in subsequent layers.
The core of ELMo consists of multiple layers of bidirectional LSTM networks. These layers process the sequence of input word vector representations in two directions: forward (left to right) and backward (right to left). The forward pass in each bidirectional layer is meant to encapsulate the preceding context of the target word, while the backward pass captures it’s following context. Each LSTM layer generates hidden states for each word in the sequence, capturing contextual information from both directions.
The final word representation is obtained by applying a learned weighted sum (scalar mix) of the hidden states from all LSTM layers. These weights are task-specific and are fine-tuned during the training of downstream tasks. The scalar mix allows the model to dynamically adjust the importance of different layers based on the specific task.

Wrap-up

Using embeddings can often feel like magic. They transform abstract linguistic concepts into tangible numerical forms that a computer can work with, revealing the hidden relationships and meanings in language data. It’s fascinating to see how a mathematical model can understand and capture the essence of human language, bringing words to life in a way that is both intuitive and powerful. Just the simple fact that embedding models work says a lot about human language and how how we use it to communicate information.

You will notice that this article has a glaring omission in describing embeddings within transformer models. We will delve into this fascinating subject in a subsequent note, exploring how these modern architectures further enhance the capabilities of NLP systems.

This note is part of a series on Artificial Neural Networks: