Feb 10, 2025 Answer.AI
I have a [MASK] and I must classify: using masked language modeling for downstream tasks works surprisingly well.
Dec 19, 2024 HuggingFace.co
This blog post introduces ModernBERT, a family of state-of-the-art encoder-only models representing improvements over older generation encoders across the board, with a 8,192 sequence length, better downstream performance and much faster processing. ModernBERT is available as a slot-in replacement for any BERT-like models, with both a base (149M params) and large (395M params) model size.
Mar 14, 2024 Answer.AI
A detailed guide for adding FSDP and QLoRA support to quantization libraries and training frameworks.
Aug 16, 2023
FlashAttention-2 builds on FlashAttention, yielding significant speedups on server-class GPUs. Unlike the PyTorch implementation of FlashAttention, FlashAttention-2 currently cannot compile into a single Cuda Graph via PyTorch 2.0's Compile. Does this matter, and if so at what model sizes and sequence lengths? In this post I attempt to answer these questions by benchmarking FlashAttention and FlashAttention-2 on a consumer GPU.