PyTorch Course: Deconstructing Modern Architectures
Course Goal: To provide a deep and essential understanding of PyTorch building blocks and their practical application in designing, understanding, and implementing modern Transformer and Diffusion-based neural network architectures.
Learner level: Beginner - Advanced
Module 0: Getting Started with PyTorch
This module ensures learners have a working PyTorch environment and a first taste of its capabilities.
What you will learn: 1. How to setup a PyTorch environment for different operating systems and test it.
Lessons:
- PyTorch Course Structure - this page, course aim and structure, guide through
- Setting Up Your PyTorch Environments: 1.Setting Up windows dev environment - create a windows dev environment with windows pyenv and poetry, install main dependencies with and without GPU support 2.Setting Up linux dev environment - create a Ubuntu dev environment with linux pyenv and poetry, install main dependencies with and without GPU support 3.Setting Up macos dev environment - create a macos dev environment with macos pyenv and poetry, install main dependencies with and without GPU support 4.Setting Up google colab - create a google colab dev environment with google colab
Module 1: PyTorch Core - I see tensors everywhere
This module dives into the fundamental components of PyTorch, essential for any deep learning task.
{width=500 height=300, align=center}
1.1 Tensors: The Building Blocks
What you will learn:
- Tensor Concept. What is a tensor? Tensor vs. Matrix. Mathematical vs. PyTorch interpretation. Why tensors are crucial for ML
- PyTorch Basics: Tensor creation and their attributes (dtype, shape, device).
- Tensor manipulation: Indexing, Slicing, Joining (torch.cat, torch.stack), Splitting. Manipulating tensor shapes (reshape, view, squeeze, unsqueeze, permute, transpose).
1.1 Lessons:
- Introduction to Tensors - Introduction to tensors, their properties, and their importance in machine learning. Creating tensors (from lists, NumPy, torch.rand, torch.zeros, torch.ones, torch.arange, torch.linspace). How to check their attributes and shapes.
- Tensor manipulation - Indexing, Slicing, Joining (torch.cat, torch.stack), Splitting. Manipulating tensor shapes (reshape, view, squeeze, unsqueeze, permute, transpose).
- Data Types and Devices - Importance of data types (float32, float16, bfloat16, int64 etc.). CPU vs. GPU computations. Checking and changing dtype. Moving tensors between devices (.to(device), .cpu(), .cuda()). Best practices for mixed-precision training (conceptual introduction). Implications of data types on memory and speed.
1.2 Tensor Operations: Computation at Scale
What you will learn:
- Overview of tensor math. Element-wise operations. Reduction operations (sum, mean, max, min, std). Basic matrix multiplication (torch.mm, torch.matmul, @ operator). Broadcasting: rules and practical examples with verifiable tiny data. In-place operations.
1.2 Lessons:
-
Tensor Math Operations - Overview of tensor math. Element-wise operations. Reduction operations across dimensions (sum, mean, max, min, std).
-
Matrix Multiplication - 2D matrix multiplication (torch.mm, torch.matmul, @ operator). Batch matrix multiplication (torch.bmm).
- Broadcasting - Broadcasting rules with practical examples across different dimensions. Broadcast math operations and vector or matrix multiplications.
1.3 Einstein Summation: The Power of einsum
What you will learn:
- Understanding Einstein notation. Why it's powerful for complex operations (e.g., attention).
1.3 Lessons:
- Einstein Summation - Simple einsum examples (vector dot product, matrix multiplication, transpose).
- Advanced Einstein Summation - einsum for more complex operations like batch matrix multiplication, tensor contractions relevant to attention mechanisms. Examples with dimensions mirroring those in Transformers.
1.4 Autograd: Automatic Differentiation
What you will learn:
- What are gradients? The computational graph. How PyTorch tracks operations.
- requires_grad attribute. Performing backward pass with .backward(). Accessing gradients with .grad. torch.no_grad() and tensor.detach().
- Gradient accumulation. Potential pitfalls. Visualizing computational graphs (conceptually).
1.4 Lessons:
- Autograd - What are gradients? The computational graph. How PyTorch tracks operations.
- Gradient Accumulation - Gradient accumulation. Potential pitfalls. Visualizing computational graphs (conceptually).
Module 2: torch.nn — Building Neural Networks
This module explores the layer-building API that powers every PyTorch model.
2.1 The nn.Module
Blueprint
What you will learn:
- The role of nn.Module
as the base class for layers and models.
- Implementing __init__
and forward
.
- Registering parameters and buffers.
- Composing modules with nn.Sequential
, nn.ModuleList
, and nn.ModuleDict
.
- Saving and restoring weights with state_dict
.
2.1 Lessons:
- nn.Module - The role of `nn.Module` as the base class for layers and models. `init` and `forward` methods.
- Compose Modules - Composing modules with `nn.Sequential`, `nn.ModuleList`, and `nn.ModuleDict`.
- Saving Weights - Saving and restoring weights with `state_dict`.
2.2 Linear Layer and Activations
What you will learn:
- Linear layers and high-dimensional matrix multiplication.
- What is the role of linear layers in attention mechanisms (query, key, value)?
- Activation functions (ReLU, GELU, SiLU, Tanh, Softmax, etc.).
- Dropout for regularisation.
2.2 Lessons:
- Linear Layer - Linear layer and high-dimensional matrix multiplication. How the linear layer transforms the input tensor into an output tensor.
- Activations - Activation functions (ReLU, GELU, SiLU, Tanh, Softmax, etc.).
- Dropout - Dropout for regularisation.
2.3 Embedding Layers
What you will learn: - Embedding layers and their purpose in neural networks. - Embedding layer implementation from scratch, initialisation, and usage. - Positional encoding and how it is used to inject order into the model.
2.3 Lessons:
- Embedding Layers - Embedding layers and their purpose in neural networks. Input to embedding layer and how to interpret the output.
- Positional Encoding - Positional encoding and how it is used to inject order into the model.
2.4 Normalisation Layers
What you will learn:
- BatchNorm vs. LayerNorm and when to use each.
- RMSNorm and other modern alternatives.
- Training vs. evaluation mode caveats.
2.4 Lessons:
- Normalization Layers - what the normalisation layer does and what is the purpose of the normalisation layer. BatchNorm vs. LayerNorm and when to use each.
- RMS Norm - RMSNorm and other modern alternatives.
- Training Evaluation Mode - Training vs. evaluation mode caveats.
2.5 Loss Functions — Guiding Optimisation
What you will learn:
- Loss functions recap, the main types of loss functions and when to use each.
- Prepare inputs and targets for loss functions and outputs interpretation (logits vs. probabilities).
- Interpreting reduction modes and ignore indices.
2.5 Lessons:
- Loss Functions - Loss functions recap, the main types of loss functions and when to use each.
- Prepare Inputs Targets - Prepare inputs and targets for loss functions and outputs interpretation (logits vs. probabilities).
- Interpreting Reduction Modes - Interpreting reduction modes and ignore indices.
Module 3: Training Workflows
Turn static graphs into learning machines by mastering data pipelines, loops, and monitoring tools.
3.1 The Training Loop
What you will learn:
- Anatomy of an epoch: forward → loss → backward → optimiser step.
- Gradient accumulation & clipping.
- Building a reusable training engine.
Lessons: 1. Training Loop
3.2 Optimisers & Schedulers
What you will learn:
- SGD with momentum, Adam, and AdamW under the hood.
- Learning-rate scheduling strategies.
- Weight decay and regularisation.
3.2 Lessons:
- Optimizers Schedulers - todo
3.3 Datasets & DataLoaders
What you will learn:
- Implementing custom Dataset
subclasses.
- Batching, shuffling and parallel loading with DataLoader
.
- Data augmentation pipelines.
3.3 Lessons:
3.4 Accelerating with GPUs
What you will learn:
- Device discovery and placement.
- Moving models and data safely.
- Mixed-precision training best practices.
3.4 Lessons:
3.5 Weight Initialisation
What you will learn:
- Why initial values matter.
- Xavier (Glorot), Kaiming, and custom strategies.
3.5 Lessons:
- Weight Initialization - todo
Module 4: Deconstructing Transformer Architectures - main building blocks of transformers
From embeddings to multi-head attention, this module builds a Transformer from first principles and explores modern variants.
4.2 Various way of injecting Order - Positional Encoding, Rotary Positional Embeddings (RoPE)
What you will learn:
- Absolute sinusoidal vs. learned embeddings.
- Relative positional encodings.
- Rotary Positional Embeddings (RoPE).
Lessons: 1. Positional Embeddings
4.3 Scaled Dot-Product Attention
What you will learn: - Query, Key, Value formalism. How to interpret the output of the attention mechanism. - Self-attention and cross-attention. - Masking techniques.
Lessons: 1. Attention Mechanism
4.4 Multi-Head Attention
What you will learn:
- Motivation for multiple heads.
- Building MHA by projecting Q, K, V.
- Comparing custom implementation with nn.MultiheadAttention
.
Lessons: 1. Multi-Head Attention
4.5 Other attention implementations
What you will learn: - Accelerated attention with SDPA - a few pytorch implementations, which to choose? - Flash Attention - a faster attention mechanism - Flash Attention 2 - a faster and more memory efficient attention mechanism
4.5 Lessons:
4.6 Transformer Encoder
What you will learn:
- Residual connections & Layer Normalisation.
- Position-wise Feed-Forward Networks.
- Encoder-decoder attention.
4.6 Lessons:
Module 5: Advanced PyTorch & Best Practices
5.1 PyTorch Hooks – Peeking Inside
What you will learn: - PyTorch Hooks - Peeking Inside - How to use PyTorch Hooks to inspect the inner workings of a model.
5.1 Lessons:
5.2 Distributed Training Concepts
What you will learn: - Distributed Training Concepts - How to use PyTorch Distributed Training to train a model on multiple GPUs.
5.2 Lessons:
5.3 Model Optimisation – Quantisation & Pruning
What you will learn: - Model Optimisation – Quantisation & Pruning - How to use PyTorch Model Optimisation to quantise and prune a model.
5.3 Lessons:
5.4 TorchScript & JIT for Deployment
What you will learn: - TorchScript & JIT for Deployment - How to use PyTorch TorchScript & JIT to deploy a model.
5.4 Lessons:
5.5 Profiling & Performance Tuning
What you will learn: - Profiling & Performance Tuning - How to use PyTorch Profiling to tune the performance of a model.
5.5 Lessons:
Module 6 Hugging Face Transformers in Practice
6.1 Loading the pre-trained models with transformers
.
What you will learn:
- Loading the pre-trained models with transformers
.
- Using AutoModel
and Trainer
APIs.
- Inference with the pre-trained models.
6.1 Lessons:
6.2 Fine-tuning the pre-trained models.
What you will learn:
- Fine-tuning the pre-trained models.
- Using Trainer
API. With training loop and trainer API.
- Using AutoModelForSequenceClassification
and AutoTokenizer
APIs.