Just Stir It Some More

A data science blog by Benjamin Warner

Aug 16, 2023

FlashAttention with PyTorch Compile
Benchmarking FlashAttention and FlashAttention-2 on a Consumer GPU

FlashAttention-2 builds on FlashAttention, yielding significant speedups on server-class GPUs. Unlike the PyTorch implementation of FlashAttention, FlashAttention-2 currently cannot compile into a single Cuda Graph via PyTorch 2.0's Compile. Does this matter, and if so at what model sizes and sequence lengths? In this post I attempt to answer these questions by benchmarking FlashAttention and FlashAttention-2 on a consumer GPU.

Jul 28, 2023

Creating a Transformer From Scratch
Part Two: The Rest of the Transformer

In this post, I will show you how to build the rest of the Transformer. By the end of this post, you will be familiar with all the pieces of a Transformer model and, combined with your knowledge of Attention, will be able to write an entire Transformer from scratch.

Jul 1, 2023

Creating a Transformer From Scratch
Part One: The Attention Mechanism

You cannot create a Transformer without Attention. In this post, I will show you how to write an Attention layer from scratch in PyTorch. By the end of this post, you will be familiar with all three flavors of Attention: Bidirectional, Causal, and Cross Attention, and should be able to write your own implementation of the Attention mechanism in code.