This is Lecture 6 from my Natural Language Processing course.

Neural Networks

Logistic Regression

We wrote the posterior probability of the class given a datapoint as:

p (c_{1} ∣ x) p (c_{2} ∣ x) = σ (w \cdot x) = 1 - p (c_{1} ∣ x)

The parameters of this model, $w$ , can be trained by conditional Maximum Likelihood:

w^{*} = ar g w max i \prod p (c_{i} ∣ x_{i}) = ar g w max i \prod σ (w \cdot x_{i})^{y_{i}} (1 - σ (w \cdot x_{i}))^{1 - y_{i}}

This is equivalent with minimizing the binary cross-entropy loss.

Activation Functions

The activation functions (turn to 0-1 possibility) used the most are:

σ (x) tanh (x) ReLU (x) IReLU (x) = \frac{1}{1 + e ^{- x}} = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}} = {0 x if x < 0 otherwise = {αx x if x < 0 otherwise

Feed-forward networks and back-propagation

Computational Graphs

Consider the following function

L (a, b, c) = c (a + 2 b)

We can decompose this into temporary variables and compute the forward and backward propagations

d = 2 b e = a + d L = ce

If we consider the function $f (x) = u (v (w (x)))$ , then the derivative of $f$ with respect to $x$ can be decomposed into $\frac{df}{d x} = \frac{d u}{d v} \cdot \frac{d v}{d w} \cdot \frac{d w}{d x}$

Transformers

Self-attention layers (Basic version)

Weigh each input $x$ by its importance within the context
Most basic version:

score (x_{i}, x_{j}) = x_{i} \cdot x_{j}

α_{i, j} = softmax_{j} (score (x_{i}, x_{j})) \forall j \leq i

y_{i} = j \leq i \sum α_{ij} x_{j}

all $y_{i}$ can be computed in parallel.

Self-attention layers (Extended version)

Each input embedding plays multiple roles:

Current focus of attention (Query)
Preceding input compared to query (Key)
Input weighted in computing the output (Value)

So, we encode these as:

q_{i} = W^{Q} x_{i} k_{i} = W^{K} x_{i} v_{i} = W^{V} x_{i}

and get:

score (x_{i}, x_{j}) = q_{i} \cdot k_{j} y_{i} = j \leq i \sum α_{ij} v_{j}

Transformers

Map sequence $x_{1}, x_{2}, ..., x_{n}$ to sequence $y_{1}, y_{2} ..., y_{n}$
Made of stacks of transformer blocks

Units of transformer blocks

Self attention layers

Residual connections

Directly pass information from lower to higher layer

Similarly improves training

Normalisation layers

Limits range of values: facilitates gradient-based learning

Make all vector elements have zero mean, unit variance

Multi-head attention and positional encoding

How to capture different kinds of relationships between inputs?
- various versions of Q,K,W
  - $MultiHeadAttn (X) = (h_{1} \oplus h_{2} \oplus \dots \oplus h_{h}) W^{O}$
  - $h_{i} = SelfAttention (Q_{i}, K_{i}, V_{i})$
Positional encoding
- Embedding contains position information

Masked training and fine tuning

Unidirectional: Predict future from past

Bidirectional: Predict anything from anything

No encoder-decoder architecture

BERT: Bidirectional encoder representations from transformers

BERT

Sub-word vocabulary, Word-Piece: 30k tokens

Hidden layers: 768 nodes

12 transformer blocks

12 multihead attention, each

🚀 Costin Chitic

Recent Notes

Transformers and Normalizing Flows

Lecture 2

.

Gimbal Lock

Quaternions

Contextual Word Embeddings and Transformers

Neural Networks

Logistic Regression

Activation Functions

Feed-forward networks and back-propagation

Computational Graphs

Transformers

Self-attention layers (Basic version)

Self-attention layers (Extended version)

Transformers

Multi-head attention and positional encoding

Masked training and fine tuning

BERT: Bidirectional encoder representations from transformers

Training BERT

Graph View

Table of Contents

Backlinks