Mixed Precision

Feb 10, 2025 Answer.AI

TIL: Masked Language Models Are Surprisingly Capable Zero-Shot Learners

Benjamin Clavié, Nathan Cooper, & Benjamin Warner

I have a [MASK] and I must classify: using masked language modeling for downstream tasks works surprisingly well.

Dec 19, 2024 HuggingFace.co

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Johno Whitaker, Jeremy Howard, & Iacopo Poli

This blog post introduces ModernBERT, a family of state-of-the-art encoder-only models representing improvements over older generation encoders across the board, with a 8,192 sequence length, better downstream performance and much faster processing. ModernBERT is available as a slot-in replacement for any BERT-like models, with both a base (149M params) and large (395M params) model size.

Mar 14, 2024 Answer.AI

Enabling 70B Finetuning on Consumer GPUs

A Technical Deep Dive into FSDP+QLoRA

Benjamin Warner, Johno Whitaker, & Kerem Turgutlu

A detailed guide for adding FSDP and QLoRA support to quantization libraries and training frameworks.

Aug 16, 2023

FlashAttention with PyTorch Compile

Benchmarking FlashAttention and FlashAttention-2 on a Consumer GPU

FlashAttention-2 builds on FlashAttention, yielding significant speedups on server-class GPUs. Unlike the PyTorch implementation of FlashAttention, FlashAttention-2 currently cannot compile into a single Cuda Graph via PyTorch 2.0's Compile. Does this matter, and if so at what model sizes and sequence lengths? In this post I attempt to answer these questions by benchmarking FlashAttention and FlashAttention-2 on a consumer GPU.