Leo Dirac talks about how Transformer models like BERT and GPT2 have taken the natural language processing (NLP) community by storm, and effectively replaced LSTM models for most practical applications. The talk covers:
- Traditional NLP. Background on Natural Language Processing, why sequence modeling is difficult for standard supervised machine learning approaches, and showing bag-of-words as a way to solve a document classification problem.
- Neural document processing: Vanilla RNN, LSTM. How neural networks process sequences with simple recurrent neural networks, and LSTM as the standard improvement upon them by solving vanishing and exploding gradients by effectively using a residual-network approach.
- Transformers. How transformer networks work: what attention mechanisms look like visually and in pseudo-code, and how positional encoding takes it beyond a bag-of-words. How transformers benefit from modern ReLU activations.
- Code. The most important advantage of transformers over LSTM is that transfer learning works, allowing you to fine-tune a large pre-trained model for your task. Shows how to do this in 12 lines of python.