Dec 19, 2024 HuggingFace.co

Finally, a Replacement for BERT

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Johno Whitaker, Jeremy Howard, & Iacopo Poli

This blog post introduces ModernBERT, a family of state-of-the-art encoder-only models representing improvements over older generation encoders across the board, with a 8192 sequence length, better downstream performance and much faster processing. ModernBERT is available as a slot-in replacement for any BERT-like models, with both a base (139M params) and large (395M params) model size.

Mar 14, 2024 Answer.ai

Enabling 70B Finetuning on Consumer GPUs
A Technical Deep Dive into FSDP+QLoRA

Benjamin Warner, Johno Whitaker, & Kerem Turgutlu

A detailed guide for adding FSDP and QLoRA support to quantization libraries and training frameworks.

Aug 16, 2023

FlashAttention with PyTorch Compile
Benchmarking FlashAttention and FlashAttention-2 on a Consumer GPU

FlashAttention-2 builds on FlashAttention, yielding significant speedups on server-class GPUs. Unlike the PyTorch implementation of FlashAttention, FlashAttention-2 currently cannot compile into a single Cuda Graph via PyTorch 2.0's Compile. Does this matter, and if so at what model sizes and sequence lengths? In this post I attempt to answer these questions by benchmarking FlashAttention and FlashAttention-2 on a consumer GPU.

Jul 28, 2023

Creating a Transformer From Scratch
Part Two: The Rest of the Transformer

In this post, I will show you how to build the rest of the Transformer. By the end of this post, you will be familiar with all the pieces of a Transformer model and, combined with your knowledge of Attention, will be able to write an entire Transformer from scratch.

All Posts