Fardin Abdi from Uber will present “An introduction to Horovod” describing techniques to scale deep learning training jobs to lots of GPUs. This will be an excellent mix of software engineering techniques for distributed computing along with the math of deep learning.
Leo Dirac will present “Geometric Intuition of Configuration Space” about Bayesian Optimization as the basis for AutoML, its fundamental limitations and ways to work around them.
For techies diving into machine learning and deep learning, the math can be daunting. While you can get a lot done without being fluent in vector spaces and linear transformations, understanding these conceptually can go a long way to making you be effective. Both by understanding what’s possible, and in interacting with scientists.
We highly recommend the video series by the amazing educator 3blue1brown (a.k.a Grant Sanderson). If you maybe took a linear algebra class a while ago and don’t remember all of it, or even if you haven’t at all, this is a super efficient way to strengthen your grasp of the basics. Or maybe that linear algebra class left you confused — fair chance these videos will explain things more clearly than your professor did. (No offense, professor.)
We’ve collected the entire series into a playlist for you here…
Or if you have more time on your hands and want to go deeper, try a full MOOC, like one of these:
Leo Dirac talks about how Transformer models like BERT and GPT2 have taken the natural language processing (NLP) community by storm, and effectively replaced LSTM models for most practical applications. The talk covers:
Traditional NLP. Background on Natural Language Processing, why sequence modeling is difficult for standard supervised machine learning approaches, and showing bag-of-words as a way to solve a document classification problem.
Neural document processing: Vanilla RNN, LSTM. How neural networks process sequences with simple recurrent neural networks, and LSTM as the standard improvement upon them by solving vanishing and exploding gradients by effectively using a residual-network approach.
Transformers. How transformer networks work: what attention mechanisms look like visually and in pseudo-code, and how positional encoding takes it beyond a bag-of-words. How transformers benefit from modern ReLU activations.
Code. The most important advantage of transformers over LSTM is that transfer learning works, allowing you to fine-tune a large pre-trained model for your task. Shows how to do this in 12 lines of python.
Leo Dirac explains what it looks like to train a neural network through geometric intuition. The structure of the talk is:
Supervised learning. What does a decision boundary look for a simple binary classification problem, and how do the data interact with it during the training process. What is a loss surface, and how does SGD find its way to the bottom of it.
Training Deep Neural Networks with Non-Convex Optimization. How neural networks make the decision boundary more complex, and what a non-convex loss surface looks like. Then some recent research into the shapes of these loss surfaces, starting with how the sharpminima theory implies we should seek a wide valley in the loss surface. Then research implying that all local minima are equivalent and connected, and a couple of algorithms including Entropy-SGD and SWA to take advantage of this structure.
Practical applications with Code. Code samples showing how to apply SWA using PyTorch or TensorFlow.