Parts of Speech

Parts of speech are a way to divide words into categories: verbs are actions (“running”, “eating”, “thinking”, …), nouns are stuff (things, people, abstract concepts such as “eternity”), adjectives are qualities (“good”, “tall”, “white”, “loyal”), etc.

So the question is: How can the computer automatically annotate the PoS tags for a sentence?

Markov Chains

Probability of starting from rain: 0.4
Probability of starting from snow: 0.3
Probability of starting from sunny: 0.3

Probability of the sequence rain snow sunny rain rain?

$0.4 \cdot 0.1 \cdot 0.4 \cdot 0.3 \cdot 0.45 = 0, 00216$

Background

Markov Assumption: $P (q_{i} = a ∣ q_{1} \dots q_{i - 1}) = P (q_{i} = a ∣ q_{i - 1})$

$P (rain ∣ snow sunshine rain snow snow) = P (rain ∣ snow)$

This expresses that the probability of the the next state is dependent only on the previous state, not on the entire history.

Let:

$Q = q_{1}, q_{2}, q_{3} ... q_{n}$ be a set of $N$ states
$A = a_{11}, a_{12} ... a_{n 1} ... a_{nn}$ be a transition probability matrix with each $a_{ij}$ representing the probability of moving from state $i$ to state $j$ . $\sum_{j = 1}^{n} a_{ij} = 1$
$π = π_{1}, π_{2}, ... π_{n}$ be an initial probability distribution over states. $π_{i}$ is the probability that the Markov Chain will start from state $i$ . Also, $\sum_{i = 1}^{n} π_{i} = 1$

What if we are in a room with no windows and we cannot look at the weather outside, but we can feel the temperature?

Hidden Markov Models

tag_{1}^{n} = ar g tag_{1}^{n} max P (word_{1}^{n} ∣ tag_{1}^{n}) P (tag_{1}^{n})

This expression finds the most likely tag sequence $tag_{1}^{n}$ given a word sequence $word_{1}^{n}$ .

P of a tag is only dependent on previous tag: $P (tag_{1}^{n}) \approx \prod P (tag_{i} ∣ tag_{i - 1})$

P of seeing a word is only dependent its PoS tag, not on previous words or PoS tags: $P (word_{1}^{n} ∣ tag_{1}^{n}) \approx \prod P (word_{i} ∣ tag_{i})$

$C (NN, dog)$ means “the number of times the word dog was tagged as NN (noun)” in the corpus.
$C (DT, NN)$ means “the number of times the tag NN follows the tag DT (determiner)” in the corpus.
$C (NN)$ means “the total number of times the tag NN appears” in the corpus.

For the first, we take the initial probability of $D T = 0.5$ and $P (tomorrow ∣ DT) = 0.1$ . For the second, we see on the graph that $P (NN ∣ D T) = 0.8$ and $P (i s ∣ NN) = 0$ and so on.

Viterbi Algorithm

The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states in a Hidden Markov Model (HMM).

Tag	Obs1 (tomorrow)	Obs2 (is)	Obs3 (another)	Obs4 (day)
NN	$0.16$	0.0048	0.00384	$0.000245$
VB	0	$0.096$	0.0015	0.000037
DT	0.05	0.0016	$0.0082$	0.000101

To compute the first column:

$P (NN ∣ initial probs) \cdot P (tomorrow ∣ NN) = 0.4 \cdot 0.4 = 0.16$
$P (V B ∣ initial probs) \cdot P (tomorrow ∣ V B) = 0.1 \cdot 0 = 0$
$P (D T ∣ initial probs) \cdot P (tomorrow ∣ D T) = 0.5 \cdot 0.1 = 0.05$

To compute the following columns:

We don’t assume the previous tag.

We calculate the probability of each possible path from the previous column

We pick the one with the highest score → this is the Viterbi step

To compute the first value on the second column:

Previous Tag	$P (prev tag ∣ co l u mn)$	$P (curr tag \to NN)$	$P (i s ∣ NN)$	P(“is”
NN	0.16	0.3	0.1	0.16 × 0.3 × 0.1 = 0.0048
VB	0	0.4	0.1	0 × 0.4 × 0.1 = 0
DT	0.05	0.8	0.1	0.05 × 0.8 × 0.1 = 0.004

How to decide if we’re good at PoS tagging?

We need a measure of performance.

Ideas:

Use techniques to deal with words that did not appear in the training corpus
Use techniques that take into account the next words too
Improve the annotation

Named Entity Recognition (NER)

NER is more useful than POS tagging in many tasks (sentiment analysis towards a company or person, question answering, information extraction, …)
However, it is a harder task than POS tagging, as in POS tagging, entities can be ambiguous:
- [PER Washington] was born into slavery on the farm of James Burroughs.
- [ORG Washington] went up 2 games to 1 in the four-game series.
- Blair arrived in [LOC Washington] for what may well be his last state visit.
- In June, [GPE Washington] passed a primary seatbelt law.
In POS tagging, each words get one tag. In NER, we need to find and segment the entities

HMM for NER

While HMM has been used for NER, it is not the most popular candidate; Conditional Random Fields and other models are better suited for the task.

When is NER useful?

classifying user intentions (e.g. when speaking to Siri/Alexa/Google Assistant: “add a meeting with Lorenzo at 15:45 at the Starbucks”)
detecting mentions of a product/company online, before extracting opinions and doing sentiment analysis on them
and more …

How to decide if we’re good at NER

recall precision = \frac{correctly found entities}{all entities in text} = \frac{1}{3} = 33% = \frac{correctly found entities}{all found entities} = \frac{1}{2} = 50%

We summarize these in one number, the F1 score, by taking their harmonic mean:

F_{1} = 2 \cdot \frac{precision \cdot recall}{precision + recall} = 2 \cdot \frac{\frac{1}{2} \cdot \frac{1}{3}}{\frac{1}{2} + \frac{1}{3}} = 0.4 = 40%

F1 score for state-of-the-art NER is ~94% on news (but only ~50% on social media)

🚀 Costin Chitic

Recent Notes

Transformers and Normalizing Flows

Lecture 2

.

Gimbal Lock

Quaternions

Hidden Markov Models for PoS tagging and NER

Parts of Speech

Markov Chains

Background

Hidden Markov Models

Viterbi Algorithm

How to decide if we’re good at PoS tagging?

Named Entity Recognition (NER)

HMM for NER

When is NER useful?

How to decide if we’re good at NER

Graph View

Table of Contents

Backlinks