Transformers

TimestampDescription
00:00:00Transformers from scratch
00:01:05Subword tokenization
00:04:27Subword tokenization with byte-pair encoding (BPE)
00:06:53The shortcomings of recurrent-based attention
00:07:55How Self-Attention works
00:14:49How Multi-Head Self-Attention works
00:17:52The advantages of multi-head self-attention
00:18:20 Adding positional information
00:20:30Adding a non-linear layer
00:22:02Stacking encoder blocks
00:22:30Dealing with side effects using layer normalization and skip connections
00:26:46Input to the decoder block
00:27:11Masked Multi-Head Self-Attention
00:29:38The rest of the decoder block
00:30:39[DEMO] Coding a Transformer from scratch
00:56:29Transformer drawbacks
00:57:14Pre-Training and Transfer Learning
00:59:36The Transformer families
01:01:05How BERT works
01:09:38GPT: Language modelling at scale
01:15:13[DEMO] Pre-training and transfer learning with Hugging Face and OpenAI
01:51:48The Transformer is a "general-purpose differentiable computer"

References and Links