Page 2 of 3 for All Writing

May 10, 2023

How to Quickly Finetune Your Transformer

While recent releases of language models have emphasized the large in Large Language Models, most everyday NLP work uses smaller language models, finetuned on custom or task specific datasets. In this post, I will show how to achieve fast finetuning performance on modern GPUs using tools like PyTorch 2.0’s torch.compile and FlashAttention.

Jan 20, 2023

Growing Cosine Unit Activation Function

Failing to Replicate CIFAR-10 Results

Last weekend the paper Growing Cosine Unit: A Novel Oscillatory Activation Function That Can Speedup Training and Reduce Parameters in Convolutional Neural Networks by Noel et al surfaced on my social feed. This paper proposes a new oscillatory activation function, called Growing Cosine Unit (GCU), which is supposed to outperform other activation functions, such as SiLU, Mish, and ReLU. This immediately drew my attention and I decided to see if I could replicate the results.

Aug 31, 2022

Training Atari DQN Agents Three to Fourteen Times Faster

Using EnvPool and a PyTorch GPU Replay Memory Buffer

While working through Unit 3 of the Hugging Face Reinforcement Learning course, I was feeling impatient by how long it took for sugggested DQN configuration to finish training. I decided to investigate the lethargic performance and succeeded in increasing the training speed of Atari DQN agents by a factor of three to fourteen using EnvPool and a custom PyTorch GPU replay memory buffer.

Aug 7, 2022

Remixed Art History with Stable Diffusion

Famous Paintings by Different Artists

After tinkering around with Stable Diffusion for a bit, I recalled seeing a couple prompts of The Great Wave Off Kanagawa by Vincent van Gogh from Imagen and MidJourneyand wondered how Stable Diffusion would do at generating famous paintings by alternate artists. So I decided to give it a try and post some of the best results.

Jul 14, 2022

Tinkering With Attention Pooling

Improving Upon Learned Aggregation

In this post, I explain what Attention Pooling is and how it works. I experiment with Touvron et al’s Learned Aggregation on several small datasets and modestly improve upon Learned Aggregation’s results with a few tweaks. I experiment with hybrid pooling layers that combine Average and Attention Pooling and increase performance in the small dataset regime. However, all of these results still lag behind the performance of Average Pooling.

Jun 14, 2022

Discovering and Debugging a PyTorch Performance Decrease

Subclassed Tensors Reduce GPU Throughput up to Forty Percent

Over the past week, Thomas Capelle and I discovered, debugged, and created a workaround for a performance bug in PyTorch which reduced image training GPU throughput up to forty percent when using fastai. The culprit? Subclassed tensors.

Newer Posts Older Posts