Dec 22, 2025 Sophont.med

Medmarks v0.1, a new LLM benchmark suite of medical tasks

Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Kunal Bagga, Ahmed Essouaied, Arya Hariharan, Sameed Khan, Anish Mahishi, Nishant Mishra, Manish Ram, Robert Scholz, Shamus Sim Zi Yang, Nikhil Khandekar, Geetu Ambwani, Maxime Griot, Ameen Patel, William Brown, Johannes Hagemann, Connor Lane, Paul S. Scotti, & Tanishq Mathew Abraham

We present Medmarks v0.1, the first release of our new evaluation suite for assessing the medical capabilities of LLMs. It comes with 2 subsets, a verifiable set of tasks and an open-ended set of tasks. This suite is the largest completely open-source automated evaluation suite for medical capabilities, with a total of 20 benchmarks. It covers tasks like answering patient questions to detecting errors in clinical notes. So far we’ve evaluated 46 models with 56 configurations. The best performing models on our benchmark suite were GPT-5.1, GPT-5.2, and Qwen3 235B-A22B Thinking. We will continue to update our leaderboard with new models and tasks.

Nov 14, 2025 Sophont.med

How to Train a State-of-the-Art Pathology Foundation Model with $1.6k

Daniel Kaplan, Ratna Sagari Grandhi, Connor Lane, Benjamin Warner, Tanishq Mathew Abraham, & Paul S. Scotti

We present OpenMidnight, a replication and improvement of the Midnight pathology foundation model, that achieves state-of-the-art performance across multiple benchmarks while being trained on just 12,000 whole-slide images for only $1.6K. We demonstrate that foundation models for computational pathology do not require massive scale to achieve top performance, and we release the full training pipeline, code, and model weights to accelerate research in this field.

Feb 10, 2025 Answer.AI

TIL: Masked Language Models Are Surprisingly Capable Zero-Shot Learners

Benjamin Clavié, Nathan Cooper, & Benjamin Warner

I have a [MASK] and I must classify: using masked language modeling for downstream tasks works surprisingly well.

Dec 19, 2024 HuggingFace.co

Finally, a Replacement for BERT

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Johno Whitaker, Jeremy Howard, & Iacopo Poli

This blog post introduces ModernBERT, a family of state-of-the-art encoder-only models representing improvements over older generation encoders across the board, with a 8,192 sequence length, better downstream performance and much faster processing. ModernBERT is available as a slot-in replacement for any BERT-like models, with both a base (149M params) and large (395M params) model size.

Mar 14, 2024 Answer.AI

Enabling 70B Finetuning on Consumer GPUs
A Technical Deep Dive into FSDP+QLoRA

Benjamin Warner, Johno Whitaker, & Kerem Turgutlu

A detailed guide for adding FSDP and QLoRA support to quantization libraries and training frameworks.

Aug 16, 2023

FlashAttention with PyTorch Compile
Benchmarking FlashAttention and FlashAttention-2 on a Consumer GPU

FlashAttention-2 builds on FlashAttention, yielding significant speedups on server-class GPUs. Unlike the PyTorch implementation of FlashAttention, FlashAttention-2 currently cannot compile into a single Cuda Graph via PyTorch 2.0's Compile. Does this matter, and if so at what model sizes and sequence lengths? In this post I attempt to answer these questions by benchmarking FlashAttention and FlashAttention-2 on a consumer GPU.