This is Lecture 6 from my Natural Language Processing course.

Neural Networks

Logistic Regression

We wrote the posterior probability of the class given a datapoint as:

The parameters of this model, , can be trained by conditional Maximum Likelihood:

This is equivalent with minimizing the binary cross-entropy loss.

Activation Functions

The activation functions (turn to 0-1 possibility) used the most are:

Feed-forward networks and back-propagation

Computational Graphs

Consider the following function

We can decompose this into temporary variables and compute the forward and backward propagations

If we consider the function , then the derivative of with respect to can be decomposed into

Transformers

Self-attention layers (Basic version)

  • Weigh each input by its importance within the context
  • Most basic version:
  • all can be computed in parallel.

Self-attention layers (Extended version)

Each input embedding plays multiple roles:

  • Current focus of attention (Query)
  • Preceding input compared to query (Key)
  • Input weighted in computing the output (Value)

So, we encode these as:

and get:

Transformers

  • Map sequence to sequence
  • Made of stacks of transformer blocks

Units of transformer blocks

  • Self attention layers
  • Residual connections
    • Directly pass information from lower to higher layer
    • Similarly improves training
  • Normalisation layers
    • Limits range of values: facilitates gradient-based learning
    • Make all vector elements have zero mean, unit variance

Multi-head attention and positional encoding

  • How to capture different kinds of relationships between inputs?
    • various versions of Q,K,W
  • Positional encoding
    • Embedding contains position information

Masked training and fine tuning

  • Unidirectional: Predict future from past
  • Bidirectional: Predict anything from anything
  • No encoder-decoder architecture

BERT: Bidirectional encoder representations from transformers

BERT

  • Sub-word vocabulary, Word-Piece: 30k tokens
  • Hidden layers: 768 nodes
  • 12 transformer blocks
  • 12 multihead attention, each

Training BERT