All Writing

Feb 10, 2025 Answer.AI

TIL: Masked Language Models Are Surprisingly Capable Zero-Shot Learners

Benjamin Clavié, Nathan Cooper, & Benjamin Warner

I have a [MASK] and I must classify: using masked language modeling for downstream tasks works surprisingly well.

Dec 19, 2024 HuggingFace.co

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Johno Whitaker, Jeremy Howard, & Iacopo Poli

This blog post introduces ModernBERT, a family of state-of-the-art encoder-only models representing improvements over older generation encoders across the board, with a 8,192 sequence length, better downstream performance and much faster processing. ModernBERT is available as a slot-in replacement for any BERT-like models, with both a base (149M params) and large (395M params) model size.

Mar 14, 2024 Answer.AI

Enabling 70B Finetuning on Consumer GPUs

A Technical Deep Dive into FSDP+QLoRA

Benjamin Warner, Johno Whitaker, & Kerem Turgutlu

A detailed guide for adding FSDP and QLoRA support to quantization libraries and training frameworks.

Aug 16, 2023

FlashAttention with PyTorch Compile

Benchmarking FlashAttention and FlashAttention-2 on a Consumer GPU

FlashAttention-2 builds on FlashAttention, yielding significant speedups on server-class GPUs. Unlike the PyTorch implementation of FlashAttention, FlashAttention-2 currently cannot compile into a single Cuda Graph via PyTorch 2.0's Compile. Does this matter, and if so at what model sizes and sequence lengths? In this post I attempt to answer these questions by benchmarking FlashAttention and FlashAttention-2 on a consumer GPU.

Jul 28, 2023

Creating a Transformer From Scratch

Part Two: The Rest of the Transformer

In this post, I will show you how to build the rest of the Transformer. By the end of this post, you will be familiar with all the pieces of a Transformer model and, combined with your knowledge of Attention, will be able to write an entire Transformer from scratch.

Jul 1, 2023

Creating a Transformer From Scratch

Part One: The Attention Mechanism

You cannot create a Transformer without Attention. In this post, I will show you how to write an Attention layer from scratch in PyTorch. By the end of this post, you will be familiar with all three flavors of Attention: Bidirectional, Causal, and Cross Attention, and should be able to write your own implementation of the Attention mechanism in code.

Older Posts