Sentiment analysis is the automatic detection of the attitude towards an object.

The best examples are movie reviews (is it negative or positive)?

It's actually quite surprising that it took such a long time for Hollywood to assassinate, pardon me, remake this very interesting story based on the 1971 Stanford prison experiment. The problem with this remake is that, as in most things Hollywood, it's all about big name actors and big fights and nice camera angles.

Naive Bayes

Input:

A document d (of which we want to know the class/sentiment)
A set of classes $C = c_{1}, c_{2}, \dots c_{j}$ — in our case $C = +, -$
A training set of m hand-labeled documents $(d_{1}, c_{1}), (d_{2}, c_{2}), \dots, (d_{m}, c_{m})$

Output:

a learned classifier $y : d \to c$

So it works like this:

What we want to know: $P (+ ∣ d) > P (- ∣ d)$

Bayes’ Rule: $P (x ∣ y) = \frac{P ( y ∣ x ) P ( x )}{P ( y )}$

$\to$ $P (+ ∣ d) = \frac{P ( d ∣ + ) P ( + )}{P ( d )}$

What we want to know: $P (d ∣ +) P (+) > P (d ∣ -) P (-)$

where:

$P (d ∣ +) =$ likelihood of the document

$P (+) =$ Prior probability of the + class

$P (f_{1}, f_{2}, f_{3}, \dots, f_{n} ∣ +) =$ Likelihood of the document, represented as set of features (words and their positions).

Bag-of-word model

The idea behind the ‘bag-of-word’ is that word order does not matter for text classification. This is obviously not true in all cases… but it is a useful simplification, and the results are often “good enough” in practice. However, there are still too many parameters. To further simplify the problem, we assume that features are mutually independent (e.g. reading ‘great’ in a review does not aﬀect the likelihood of reading ‘fantastic’ later on in the same review).

Naive Bayes training

P (f_{1}, f_{2}, f_{3}, \dots, f_{n} ∣ +) P (+)

How can we find $P (f_{i} ∣ +)$ and $P (+)$

P (+) = \frac{N _{+}}{N _{d oc}}

P (f_{i} ∣ +) = \frac{C ( w _{i} , + )}{w \in V \sum C ( w , + )}

How many times does $w_{i}$ appear, out of all words that appear in positive documents? *This is a Language Model

What we want to know: $P (+) \prod P (f_{i} ∣ +) > P (-) \prod P (f_{i} ∣ -)$

Is “a good movie” positive or negative?

$P (+) P (a ∣ +) P (g oo d ∣ +) P (m o v i e ∣ +) = 0.041$
$P (-) P (a ∣ -) P (g oo d ∣ -) P (m o v i e ∣ -) = 0.034$

What happens we need to estimate $P (f_{i} ∣ +)$ for a word we have never seen before in the training set (e.g. the title of a new movie)?

We can solve this by “boosting” all counts, e.g. with Laplace smoothing. We redefine Count as “Count + 1”
$P (f_{i} ∣ +) = \frac{C ( w _{i} , + ) + 1}{w \in V \sum C ( w , + ) + ∣ V ∣}$

Practical Issues of Naive Bayes

Irony, and especially sarcasm, can be challenging (for every sentiment analysis algorithm, not just NB):

“Battlefield Earth saves its scariest moment for the end: a virtual guarantee that there will be a sequel.”
“Valentine’s Day is being marketed as a Date Movie. I think it’s more of a First-Date Movie. If your date likes it, do not date that person again. And if you like it, there may not be a second date.”

$P ()$ of a chain of observations quickly becomes a tiny number: $P (+) \prod P (f_{i} ∣ +)$

P (+) P (m o v i e ∣ +) = 0.38 = 0.10 P (a ∣ +) P (g oo d ∣ +) = 0.12 = 0.08 P (v ery ∣ +) P (n o t ∣ +) = 0.02 = 0.02

$P ($ not a very good movie $) = 0.00000021888$

Most computer languages cannot represent tiny numbers accurately. In Python:

1/10 == 0.10000000000000001 \to T r u e

Solution: move everything to log space, where the logarithm of a product is the sum of the individual logarithms: Now we are only adding numbers, so they become easier to represent.

lo g (P (+) \prod P (f_{i} ∣ +)) = lo g P (+) + \sum lo g P (f_{i} ∣ +)

Ignoring word order means ignoring negation (“Contrarily to my expectations, it was not bad at all”). How can we fix this?

Can be mitigated with preprocessing, by adding a prefix to words after negation, until punctuation (e.g. “NOT_”).

The pre-processed text becomes “it was not NOT_bad NOT_at NOT_all”.

We expect that $P (NOT_ba d ∣ +) > P (ba d ∣ +)$

What if we have too little data to train our classifier eﬀectively?

Estimate word similarity (next week’s lecture) between unknown word and words we know about. Then assign the same weight as most n similar known words.

Use external information from a sentiment dictionary

$P (pos_lexicon ∣ +) P (neg_lexicon ∣ +)$ from the sentiment dictionary

If “horrible” and “awesome” are not in training, P(horrible) == P(awesome) ?

Use external information from a sentiment dictionary

Why use Naive Bayes?

It is very fast to train, and very fast to classify
Low storage requirement: you only need to store 2 numbers per word in your corpus
Works well with limited amount of features
Robust to stop words and irrelevant features
It is easy to interpret why a certain review/post/mail/etc. has been classified the way it is (by checking the contribution of each feature).

🚀 Costin Chitic

Recent Notes

Transformers and Normalizing Flows

Lecture 2

.

Gimbal Lock

Quaternions

Sentiment Analysis and Naive Bayes

Naive Bayes

Bag-of-word model

Naive Bayes training

How can we find $P (f_{i} ∣ +)$ and $P (+)$

Practical Issues of Naive Bayes

Why use Naive Bayes?

Graph View

Table of Contents

Backlinks

🚀 Costin Chitic

Recent Notes

Transformers and Normalizing Flows

Lecture 2

.

Gimbal Lock

Quaternions

Sentiment Analysis and Naive Bayes

Naive Bayes

Bag-of-word model

Naive Bayes training

How can we find P(fi​∣+) and P(+)

Practical Issues of Naive Bayes

Why use Naive Bayes?

Graph View

Table of Contents

Backlinks

How can we find $P (f_{i} ∣ +)$ and $P (+)$