This is Lecture 5 from my Natural Language Processing course.

Transitioning to Word Meaning

So far:

n-grams $\to$ predict next word from sequence of tokens

Naive Bayes $\to$ classify using word counts/probabilities

POS/NER $\to$ assign symbolic categories to words

Limitations:

Model knows dog $\neq =$ cat, but not that they’re similar

No notion of synonymy, antonymy, or relatedness

We want our models to understand that:

buy, sell, pay are related through events (similarity)
happy and sad are opposites (relatedness)
coffee and cup co-occur in real life (connotation)

Vector Semantics

It’s used to define meaning by linguistic distribution: look at its neighbouring words or grammatical environments

Foundations

Words are represented as points (vectors) in some multi-dimensional space

Word vectors are generally called embeddings

Semantically similar words are mapped to nearby points, that is “are embedded nearby each other”

A desirable property: proximity in the vector space $\to$ semantic similarity/relatedness

Words as Vectors

Term-document matrix

Each document is represented as a count vector (a column) and each word is also a vector (a row)! The dimensionality depends on context

Term-term matrix

In this case words are both rows and columns. Each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus. The dimensionality is $∣ V ∣ \times ∣ V ∣$ .

Cosine for measuring similarity

How do we calculate the similarity (or distance) between word vectors?

Cosine similarity measures the similarity in the direction or orientation of the vectors, ignoring differences in their magnitude or scale.

similarity (v, w) = cos (θ) = \frac{v \cdot w}{∥ v ∥∥ w ∥} = \frac{\sum _{i}^{N} v _{i} w _{i}}{\sum _{i}^{N} v _{i}^{2} \sum _{i}^{N} w _{i}^{2}}

TF-IDF: Weighing terms in the vector

Raw frequency alone is not a reliable measure of association between words, as it can be skewed and lacks discrimination power.

We need to balance two important constraints:

Words that frequently co-occur within a given context are more significant than those that only appear a few times.
Co-occurrences with highly frequent words are less informative and should be down-weighted accordingly.

TF-IDF weighting

Term frequency $t f_{t, d}$ $t f_{t, d} = co u n t (t, d)$ or $t f_{t, d} = l o g_{10} (co u n t (t, d) + 1)$

Inverse document frequency $i d f_{t}$ $i d f_{t} = l o g_{10} N$ , where $N$ is the total number of documents, and $d f_{t}$ is the number $d f_{t}$ of documents the term $t$ occurs in.

TF-IDF weight $w_{t, d} = t f_{t, d} \times i d f_{t}$

Long and sparse vectors do not model synonyms. Sparse vectors may not capture the similarity between words that have either as neighbours (e.g. car and automobile).

Because of this reason, we define short and dense vectors.

Short and Dense Vectors

Latent Semantic Analysis (LSA)

Global co-occurrence count-based method

Based on statistics of how often some word co-occurs with its neighbour words in a large text corpus

Dimensionality reduction methods (SVD or Random Projection)

We keep the truncated matrix T as the word embeddings

Using the dense vectors for similarity computations:

Typically gives better results (filter out noise)
Faster

Word2Vec

Local context predictive method

Instead of counting, train a classifier on a binary prediction task: is word $w 1$ likely to show up near $w 2$ ?

a simple task: binary classification instead of word prediction

a simple architecture: logistic regression instead of multilayer neural network

no need for human labels

Word2Vec: Skip-gram(context representation: better coverage of rare words) + Negative sampling(training method: more intuitive)

The Skip-Gram classifier

Train a probabilistic classifier that, given

a test target word t,

its context window of L words $c_{1 : L}$ ,

assigns a probability based on how similar this context window is to the target word

This classifier gives a reasonably high probability estimate to all words that occur in the context

Lower probabilities to noise words (negative examples)

Intuition

Treat the target word and a neighbouring context word as positive examples.

Randomly sample other words in the lexicon to get negative samples.

Use logistic regression to train a classifier to distinguish those two cases.

Use the regression weights as the embeddings.

The goal is to train a classifier, such that, given a $(t, c)$ pair, assigns the probability:

$P (+ ∣ t, c)$ : the probability that c is a real context word of t
$P (- ∣ t, c) = 1 - P (+ ∣ t, c)$ : the probability that c is not a real context word of t

The probability for one context word

P (+ ∣ t, c) = σ (t \cdot c) = \frac{1}{1 + e ^{- t \cdot c}}

Assuming all L context words $c_{1 : L}$ are independent

P (+ ∣ t, c_{1 : L}) = i = 1 \prod L σ (t \cdot c_{i})

lo g P (+ ∣ t, c_{1 : L}) = i = 1 \sum L lo g σ (t \cdot c_{i})

The sigmoid function turns the dot products into probabilities.

Skip-gram Loss function

Consider one positive example $(t, c)$ with its $k$ noise words, $n 1, ..., nk$ , the learning objective is to minimise this loss function $L_{CE}$ :

One step of gradient descent:

Shift the embedding of the target word apricot towards that of the real context word jam
Shift the embedding of the target word apricot away from that of the noise word Tolstoy

As a result:

Words that share many contexts get close to each other
Contexts that share many words get close to each other
Represent each target word as a d dimensional vector $\to W_{∣ V ∣ \times d}$
Represent each context word as a d dimensional vector $\to C_{∣ V ∣ \times d}$

Bias in word embeddings

Word embeddings can reflect gender, ethnicity, age, sexual orientation and other biases of the text used to train the model.

Here comes the riddle

A man and his son get into a terrible car crash. The father dies, and the boy is badly injured. In the hospital, the surgeon looks at the patient and exclaims, “I cannot operate on this boy, he is my son!”

The idea from the riddle above is that our brains would usually associate surgeon with a men profession. The surgeon in this case is the mother.

🚀 Costin Chitic

Recent Notes

Transformers and Normalizing Flows

Lecture 2

.

Gimbal Lock

Quaternions

Vector Semantics and Embeddings

Transitioning to Word Meaning

Vector Semantics

Words as Vectors

Term-document matrix

Term-term matrix

Cosine for measuring similarity

TF-IDF: Weighing terms in the vector

Short and Dense Vectors

Latent Semantic Analysis (LSA)

Word2Vec

Word2Vec: Skip-gram(context representation: better coverage of rare words) + Negative sampling(training method: more intuitive)

Skip-gram Loss function

Bias in word embeddings

Graph View

Table of Contents

Backlinks