Benchmarking PyTorch’s Native Mish

Benjamin Warner

July 19, 2021

11 min. read

For computer vision ⊕The code for this post can be found here. tasks MishDiganta Misra. 2019. Mish: A self regularized non-monotonic neural activation function. arXiv:1908.08681. is my go to activation function. When training new models or new layers during transfer learning, I have found that Mish performs on par if not better then alternative activation functions.

The primary downside of using Mish has been performance. The PyTorch JIT versions have often been slower than other PyTorch native activation functions. While it was possible to achieve similar performance to a native activation function it required installing a CUDA implementation which isn’t possible on every platformSuch as Kaggle Notebooks..

When Diganta Misra announced that Mish would have a native implementation in PyTorch 1.9, I hoped it would it would match or exceed the computational performance of Thomas Brandon’s MishCuda implementation. And after 1.9’s release, decided to benchmark it to find out.

But first, some background on Mish and activation functions ⊕To skip directly to the benchmark, click here..

# Activation Functions

Activation functions are a necessary part of neural networks as they add non-linearities to an otherwise linear network which allows them to learn non-linear complex functions. Without activation functions, neural networks would be very high polynomial linear functions and training models would be akin to fitting a linear regression.

Early on, neural networks used Sigmoid or TanH as their activation functions but transitioned to Rectified Linear Unit (ReLU)Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep Sparse Rectifier Neural Networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR, 315–323. which has remained the default activation function across many models and papers due to its simplicity and consistent performance across most tasksIn some domains, such as tabular neural networks, ReLU is still one of the best activation functions, not just the default..

Mathematically, ReLU is defined as:

\tag{1}f(x) = \max(0, x)

with a plot of ReLU is shown in Figure 1 below.

Graphical representation of ReLU. From Towards Data Science.

ReLU has a few weaknesses which have spurred the hunt for better activation functions. The discontinuous derivative and flat nonlinear function for negative values creates an output landscape with sharp transitions and a rough profile, which can cause undesired outcomes during gradient-based optimization.

Output landscape of randomly initialized neural network with ReLU

One possible negative outcome is the Dying ReLU problem, where the information loss of negative inputs being zeroed prevents gradient updates from flowing through the network during backpropagation; leading to stagnant and unchanged weights. In a worst case scenario this results in a large portion of the model unable to learn further. Fortunately, a properly calibrated learning rate can prevent this problem from occurring.

# Mish

Mish is a self-gated, smooth, continuously differentiable, and nonmonotonic activation function. Mish is defined as:

\tag{2}f(x) = x\tanh(\textmd{softplus}(x))

where $\textmd{softplus}(x)$ Qian Liu and Steve Furber. 2016. Noisy Softplus: A Biology Inspired Activation Function. In Neural Information Processing. DOI:10.1007/978-3-319-46681-1_49 is $\ln⁡(1+𝑒^𝑥)$ . It is bounded below and unbounded above with a range of $[≈-0.31, ∞)$ . Mish was inspiredMisra cites Swish as the inspiration for Mish, however as Swish is SiLU but with a trainable β parameter which is usually set to 1 (making it identical to SiLU), I will refer to Swish as SiLU. by SiLUStefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. arXiv:1702.03118. which shares many similar properties.

All of these properties combined—self-gated, smooth, continuously differentiable, unbounded above, bounded below, and nonmonotonic—give Mish advantages over other activation functions such as ReLU.

Inspired by Long Short-Term Memory (LSTMs)Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (November 1997), 1735–1780. DOI:10.1162/neco.1997.9.8.1735 and similar architectures’ gate structures, self-gating allows Mish to modulate its input by using the input as a gate to multiply with the output of the non-linear function of the input. Additionally, it allows Mish to be a drop-in replacement for non-gated activation functions, as Mish only requires a single input verses a traditional gate’s requirement of multiple inputs.

Mish’s smooth profile assists in better gradient flow, which allows for easier updates during backpropagation. In addition to being smooth, Mish’s continuous differentiability means Mish avoids singularities in its derivative which can hinder gradient-based optimization.

Mish is unbounded above which prevents saturation, that is unnecessarily slowing down training due to near-zero gradients, a feature it shares with ReLU. Mish is bounded below which gives it strong regularization effects during training. But unlike ReLU, Mish allows a small amount of negative outputs to pass through, which eliminates the Dying ReLU problem and leads to improved information propagation throughout the network. This gives Mish its nonmonotonic property and allows it to achieve superior performance in very deep networks to monotonic activation functions.

Output landscape of randomly initialized neural network with Mish

Figure 4 shows how these properties create a smooth output landscape with smooth transitions. A stark visual contrast to ReLU as shown in Figure 2.

Since Mish’s introduction, Mish has been benchmarked in numerous vision tasks with far more wins than losses, including outperforming SiLU in 75% of these tasks. Mish has been used in models such as YOLOv4Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv:2004.10934. which currently is state of the art for real-time object detection on MSCOCO, it was found to have similar adversarial robustness to GELUDan Hendrycks and Kevin Gimpel. 2020. Gaussian Error Linear Units (GELUs). arXiv:1606.08415., and the paper was accepted in the prestigious British Machine Vision Conference 2020. Mish was also used by the fast.ai community to achieve state of the art accuracy on the Imagenette leaderboard, which is where I first learned about it.

# Benchmarking

To benchmark PyTorch native Mish’s performance against other implementations and activation functions, I tested ReLU, Leaky ReLUAndrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing., Softplus, SiLU, and Mish via the following implementations:

ReLU: PyTorch’s native implementation
Leaky ReLU: PyTorch’s native implementation
Softplus: PyTorch’s native implementation
SiLU (jit): fast.ai’s Torchscript implementation
SiLU (native): PyTorch’s native implementation
Mish (naive): a plain PyToch implementation
Mish (jit): fast.ai’s Torchscript implementation
Mish (cuda): Thomas Brandon’s CUDA implementation
Mish(native): PyTorch’s native implementation

To measure the computational performance of all nine activation function implementations on Google Colab using their Tesla V100, Tesla P100, and CPU instances, I lightly modified Thomas Brandon’s CUDA benchmarking script. The script uses a rank four tensor of size (16,10,256,256)I also benchmarked computational performance on a rank four tensor of size (64,10,256,256), but did not reproduce the results as they tell the same story. and records 100 forward and backward passes after 10 warmup runs. My version adds bfloat16 to the list of float16, float32, and float64 tensor types and a CPU benchmark where I only measured float32 performance. I additionally ran a PyTorch 1.8 test on a Tesla V100 to see if there have been changes in other activation function implementation performance across PyTorch versions.

All results are using CUDA 10.2.

Code and raw results for all benchmarks are available here.

# V100

On a Tesla V100, PyTorch native Mish’s performance is competitive with all other activation functions and implementations, if not the outright leader. For example, during Mixed Precision training Mish should have the fastest forward pass and should only be outperformed during the backward pass by SiLU. On Volta hardware native Mish is a clear winner over other Mish implementations, and I would be surprised if this lead did not hold on modern Ampere hardware too.

Given the complexity differences, one surprising result is the performance of Mish and SiLU relative to ReLU. With the exception of float64, native Mish is faster than native ReLU, and native SiLU is always faster than native ReLU.

Activation Functions’ Forward and Backward Pass on a Tesla V100-SXM2-16GB using PyTorch 1.9.

		Forward Pass				Backward Pass
Tensor Type	Function	Mean	StDev	Min	Max	Mean	StDev	Min	Max
float16	ReLU	100.4µs	4.316µs	96.26µs	116.7µs	179.6µs	16.79µs	161.8µs	301.1µs
	Leaky ReLU	95.32µs	3.966µs	92.16µs	120.8µs	178.8µs	28.97µs	158.7µs	323.6µs
	Softplus	105.4µs	7.069µs	100.4µs	151.6µs	237.3µs	378.3µs	171.0µs	3.813ms
	SiLU (jit)	114.3µs	62.76µs	99.33µs	638.0µs	249.7µs	288.3µs	169.0µs	2.765ms
	SiLU (native)	86.78µs	2.632µs	83.97µs	100.4µs	158.8µs	12.15µs	147.5µs	196.6µs
	Mish (naive)	239.4µs	7.391µs	232.4µs	272.4µs	502.3µs	341.5µs	463.9µs	3.900ms
	Mish (jit)	187.0µs	4.837µs	182.3µs	204.8µs	254.5µs	9.229µs	248.8µs	328.7µs
	Mish (cuda)	116.1µs	5.289µs	107.5µs	136.2µs	185.1µs	21.07µs	169.0µs	291.8µs
	Mish (native)	84.57µs	3.283µs	81.92µs	101.4µs	168.2µs	16.51µs	155.6µs	271.4µs
bfloat16	ReLU	94.56µs	65.80µs	83.97µs	747.5µs	197.6µs	246.0µs	143.4µs	1.988ms
	Leaky ReLU	84.14µs	5.659µs	80.90µs	116.7µs	155.2µs	13.22µs	143.4µs	199.7µs
	Softplus	97.58µs	3.035µs	95.23µs	109.6µs	152.3µs	13.03µs	146.4µs	261.1µs
	SiLU (jit)	178.3µs	7.874µs	172.0µs	213.0µs	598.2µs	412.6µs	542.7µs	4.677ms
	SiLU (native)	84.90µs	4.708µs	80.90µs	113.7µs	187.3µs	162.1µs	144.4µs	1.535ms
	Mish (naive)	242.1µs	5.280µs	236.5µs	262.1µs	472.0µs	70.70µs	458.8µs	1.011ms
	Mish (jit)	258.8µs	39.72µs	247.8µs	647.2µs	796.5µs	265.8µs	758.8µs	3.290ms
	Mish (cuda)	258.8µs	39.72µs	247.8µs	647.2µs	796.5µs	265.8µs	758.8µs	3.290ms
	Mish (native)	87.34µs	5.521µs	82.94µs	119.8µs	170.9µs	21.28µs	154.6µs	332.8µs
float32	ReLU	132.7µs	5.097µs	128.0µs	149.5µs	235.3µs	124.5µs	216.1µs	1.331ms
	Leaky ReLU	134.6µs	53.66µs	123.9µs	663.6µs	291.6µs	416.3µs	216.1µs	3.420ms
	Softplus	127.1µs	3.866µs	123.9µs	144.4µs	219.1µs	3.852µs	217.1µs	254.0µs
	SiLU (jit)	305.7µs	101.5µs	288.8µs	1.312ms	974.2µs	201.2µs	932.9µs	2.338ms
	SiLU (native)	127.4µs	4.018µs	122.9µs	144.4µs	217.9µs	754.2ns	216.1µs	220.2µs
	Mish (naive)	396.4µs	71.03µs	382.0µs	1.080ms	912.2µs	748.6µs	819.2µs	8.176ms
	Mish (jit)	408.6µs	40.85µs	395.3µs	805.9µs	1.410ms	846.5µs	1.294ms	9.495ms
	Mish (cuda)	141.0µs	4.121µs	135.2µs	158.7µs	235.6µs	77.05µs	222.2µs	969.7µs
	Mish (native)	128.3µs	2.898µs	124.9µs	139.3µs	220.5µs	952.9ns	219.1µs	224.3µs
float64	ReLU	324.0µs	4.909µs	318.5µs	341.0µs	492.1µs	170.1µs	470.0µs	2.169ms
	Leaky ReLU	319.6µs	4.031µs	314.4µs	334.8µs	472.4µs	2.247µs	469.0µs	485.4µs
	Softplus	312.7µs	4.179µs	307.2µs	329.7µs	470.3µs	1.141µs	468.0µs	473.1µs
	SiLU (jit)	722.5µs	9.100µs	712.7µs	756.7µs	2.447ms	172.7µs	2.416ms	4.107ms
	SiLU (native)	316.3µs	5.669µs	310.3µs	348.2µs	473.1µs	1.782µs	470.0µs	482.3µs
	Mish (naive)	995.5µs	18.37µs	985.1µs	1.165ms	1.999ms	79.57µs	1.979ms	2.789ms
	Mish (jit)	1.009ms	10.14µs	997.4µs	1.050ms	3.405ms	178.1µs	3.374ms	5.177ms
	Mish (cuda)	344.1µs	5.212µs	340.0µs	371.7µs	470.1µs	1.002µs	469.0µs	476.2µs
	Mish (native)	334.5µs	4.198µs	328.7µs	348.2µs	479.5µs	1.436µs	478.2µs	489.5µs

Lower is better. Fastest pass and those statistically indistinguishable at the p=0.001 level are highlighted.

# P100

Unlike the Tesla V100, the Tesla P100 had wide variance in computational performance for many activation functions. The large variance for the bfloat16, float32, and float64 backward pass caused the computational time of many of these activation functions to be statistically indistinguishable from each other at the p=0.001 level. Changing the number of timed iterations from 100 to 50 & 25 and warmup runs from 10 to 5 did not materially change the variance of the tests.

Focusing on float16 where there is less variance, PyTorch’s native Mish had a significant improvement over other Mish implementations but its performance still lags most of the other PyTorch native activation functions, with the notable exception of Softplus, which brings up the rear.

Despite these results, native Mish is either more performant or statistically indistinguishable from other Mish implementations.

Activation Functions’ Forward and Backward Pass on a Tesla P100-PCIE-16GB using PyTorch 1.9.

		Forward Pass				Backward Pass
Tensor Type	Function	Mean	StDev	Min	Max	Mean	StDev	Min	Max
float16	ReLU	114.3µs	15.35µs	106.2µs	236.0µs	401.8µs	502.5µs	327.6µs	5.170ms
	Leaky ReLU	104.9µs	15.58µs	101.1µs	257.4µs	316.7µs	15.29µs	314.0µs	468.8µs
	Softplus	225.3µs	61.01µs	214.5µs	831.2µs	449.3µs	65.98µs	436.4µs	1.091ms
	SiLU (jit)	181.8µs	21.29µs	177.0µs	363.1µs	364.7µs	769.8ns	362.7µs	366.6µs
	SiLU (native)	128.5µs	4.818µs	124.5µs	148.4µs	333.0µs	183.8µs	309.5µs	2.113ms
	Mish (naive)	434.0µs	25.70µs	428.3µs	686.8µs	932.5µs	213.1µs	900.4µs	2.707ms
	Mish (jit)	378.0µs	7.118µs	368.3µs	400.0µs	625.6µs	307.2µs	590.4µs	3.681ms
	Mish (cuda)	277.4µs	3.832µs	273.4µs	305.4µs	478.6µs	467.8ns	477.8µs	480.2µs
	Mish (native)	206.1µs	44.46µs	197.5µs	644.7µs	446.4µs	26.33µs	441.5µs	678.7µs
bfloat16	ReLU	107.4µs	2.817µs	105.2µs	119.2µs	307.4µs	5.838µs	305.6µs	359.3µs
	Leaky ReLU	103.2µs	3.106µs	100.8µs	122.3µs	307.4µs	2.372µs	306.5µs	330.7µs
	Softplus	240.4µs	32.73µs	235.0µs	563.8µs	429.4µs	328.5µs	395.8µs	3.698ms
	SiLU (jit)	273.8µs	6.896µs	268.6µs	311.7µs	909.9µs	114.8µs	896.0µs	2.052ms
	SiLU (native)	143.6µs	5.667µs	138.5µs	175.3µs	360.0µs	235.9µs	314.3µs	2.393ms
	Mish (naive)	463.9µs	61.92µs	454.0µs	1.079ms	884.3µs	212.4µs	861.4µs	2.998ms
	Mish (jit)	472.1µs	5.339µs	467.2µs	490.6µs	1.371ms	283.2µs	1.329ms	3.923ms
	Mish (cuda)	472.1µs	5.339µs	467.2µs	490.6µs	1.371ms	283.2µs	1.329ms	3.923ms
	Mish (native)	218.0µs	12.73µs	212.5µs	338.4µs	511.6µs	231.6µs	453.4µs	1.670ms
float32	ReLU	189.9µs	47.20µs	182.3µs	658.5µs	468.4µs	258.1µs	441.3µs	3.037m
	Leaky ReLU	185.1µs	40.48µs	177.8µs	586.2µs	471.9µs	223.4µs	441.4µs	2.584ms
	Softplus	208.0µs	47.14µs	197.9µs	674.7µs	465.0µs	215.5µs	442.1µs	2.609ms
	SiLU (jit)	433.0µs	27.39µs	423.3µs	698.0µs	1.574ms	286.1µs	1.525ms	4.144ms
	SiLU (native)	181.3µs	6.474µs	176.3µs	224.7µs	447.2µs	40.24µs	441.3µs	847.6µs
	Mish (naive)	595.9µs	19.62µs	586.6µs	773.0µs	1.365ms	69.41µs	1.352ms	1.946ms
	Mish (jit)	605.0µs	6.587µs	599.2µs	641.4µs	2.148ms	401.9µs	2.090ms	5.812ms
	Mish (cuda)	283.4µs	4.760µs	278.7µs	300.0µs	490.5µs	21.11µs	485.7µs	700.1µs
	Mish (native)	208.9µs	40.96µs	199.9µs	610.7µs	497.6µs	218.6µs	466.3µs	2.442ms
float64	ReLU	392.7µs	4.992µs	386.7µs	415.8µs	721.7µs	204.1µs	694.0µs	2.732ms
	Leaky ReLU	390.2µs	8.284µs	382.8µs	447.1µs	707.8µs	79.32µs	694.6µs	1.455ms
	Softplus	489.0µs	16.10µs	482.1µs	636.3µs	813.3µs	180.5µs	783.9µs	2.473ms
	SiLU (jit)	1.006ms	6.768µs	995.1µs	1.026ms	3.258ms	112.6µs	3.239ms	4.378ms
	SiLU (native)	382.9µs	7.561µs	377.0µs	435.7µs	692.6µs	2.279µs	688.0µs	709.7µs
	Mish (naive)	1.343ms	8.757µs	1.332ms	1.403ms	2.828ms	390.0µs	2.766ms	5.945ms
	Mish (jit)	1.355ms	6.540µs	1.346ms	1.385ms	4.575ms	212.3µs	4.541ms	6.687ms
	Mish (cuda)	717.0µs	4.584µs	713.2µs	746.3µs	976.7µs	611.9ns	975.2µs	979.6µs
	Mish (native)	645.9µs	5.989µs	639.8µs	669.6µs	976.7µs	201.0µs	943.0µs	2.408ms

Lower is better. Fastest pass and those statistically indistinguishable at the p=0.001 level are highlighted.

# CPU

The CPUs which Google Colab serves on their instances are variable from generation to generation. I was able to run the CPU benchmarkYou shouldn’t train neural networks on CPU, but just in case I timed the backward pass in addition to the forward pass to benchmark inference. on Haswell, Broadwell, and SkylakeIntel has multiple product generations, Skylake, Cascade Lake, and Cooper Lake, which all share the same Family 6 Model 85 designation, but due to the cache size I think I benchmarked on a Skylake. Xeon CPUs. There were not any major differences between CPU generations, other than older CPUs being slower, so I only reproduced the Haswell and Skylake results.

Unsurprising due to its simplicity, ReLU has the fastest forward pass across all tested CPU generations. The surprising result is PyTorch native Mish is significantly slower than all other activation functions. A result replicated across all tested CPU generations. A priori, I would have expected it to perform on par with native SiLU, but Mish is an order of magnitude slower. The other intriguing result is a backward pass using Softplus was faster than a forward pass. This oddity persisted across multiple runs on all CPU generations.

When using CPU for inference, it would be best to replace the native Mish version with the TorchScript version. At least until PyTorch improves the CPU optimizes for native Mish.

Activation Functions’ Forward and Backward Pass on a Haswell and Skylake Xeon using PyTorch 1.9

		Forward Pass				Backward Pass
CPU	Function	Mean	StDev	Min	Max	Mean	StDev	Min	Max
Haswell	ReLU	19.95ms	618.7µs	18.61ms	23.66ms	44.59ms	1.026ms	42.70ms	49.14ms
	Leaky ReLU	20.60ms	850.2µs	17.80ms	25.07ms	45.02ms	1.071ms	42.83ms	48.56ms
	Softplus	61.69ms	1.328ms	59.94ms	65.84ms	49.82ms	1.341ms	48.06ms	56.02ms
	SiLU (jit)	47.55ms	2.001ms	44.50ms	53.54ms	168.6ms	2.993ms	162.4ms	181.5ms
	SiLU (native)	24.50ms	1.224ms	22.62ms	30.06ms	50.89ms	1.382ms	48.49ms	58.31ms
	Mish (naive)	123.6ms	2.575ms	120.1ms	132.6ms	143.0ms	2.669ms	138.1ms	154.2ms
	Mish (jit)	123.0ms	2.387ms	120.1ms	132.0ms	292.3ms	3.622ms	284.6ms	301.6ms
	Mish (native)	144.5ms	1.294ms	142.3ms	149.8ms	183.0ms	2.418ms	178.8ms	192.1ms
Skylake	ReLU	9.791ms	541.3µs	9.225ms	13.19ms	26.25ms	808.5µs	25.39ms	29.88ms
	Leaky ReLU	10.69ms	664.5µs	9.579ms	13.68ms	28.27ms	1.210ms	26.49ms	31.89ms
	Softplus	48.18ms	2.954ms	43.05ms	53.93ms	32.99ms	2.090ms	30.05ms	40.69ms
	SiLU (jit)	27.55ms	1.636ms	25.81ms	34.08ms	95.21ms	2.974ms	91.30ms	105.0ms
	SiLU (native)	13.83ms	687.5µs	13.23ms	17.85ms	30.66ms	1.096ms	29.63ms	35.38ms
	Mish (naive)	74.72ms	5.135ms	68.01ms	85.55ms	84.82ms	3.569ms	78.94ms	94.52ms
	Mish (jit)	73.17ms	4.608ms	67.65ms	85.18ms	170.3ms	7.535ms	160.6ms	189.9ms
	Mish (native)	113.9ms	7.488ms	103.4ms	129.7ms	142.7ms	9.413ms	130.2ms	164.6ms

Lower is better. All results for float32. Fastest pass and those statistically indistinguishable at the p=0.001 level are highlighted.

# PyTorch 1.8 V100

PyTorch did not have a native implementation of Mish in version 1.8, so this test is to see how much variance there is in activation function performance from PyTorch version to PyTorch version. Most PyTorch 1.8 activation functions are either statistically indistinguishable at the p=0.001 level or are slower than their PyTorch 1.9 counterparts. Native SiLU’s forward pass had a noticeably consistent improvement in speed from PyTorch 1.8 to PyTorch 1.9.

One notable exception is the forward pass of native ReLU, which had a statistically significant decrease in performance for both float16 and float32 from PyTorch 1.8 to PyTorch 1.9.

Activation Functions’ Forward and Backward Pass on a Tesla V100-SXM2-16GB using PyTorch 1.8.

		Forward Pass				Backward Pass
Tensor Type	Function	Mean	StDev	Min	Max	Mean	StDev	Min	Max
float16	ReLU	96.16µs	2.513µs	94.21µs	107.5µs	174.2µs	10.19µs	159.7µs	217.1µs
	Leaky ReLU	99.91µs	23.17µs	88.06µs	293.9µs	248.1µs	421.7µs	155.6µs	3.763ms
	Softplus	99.59µs	3.792µs	96.26µs	120.8µs	192.7µs	179.1µs	158.7µs	1.964ms
	SiLU (jit)	110.8µs	4.060µs	106.5µs	126.0µs	208.9µs	21.59µs	186.4µs	347.1µs
	SiLU (native)	94.23µs	4.252µs	91.14µs	111.6µs	173.7µs	21.12µs	154.6µs	303.1µs
	Mish (naive)	241.1µs	5.324µs	235.5µs	267.3µs	504.7µs	359.3µs	463.9µs	4.080ms
	Mish (jit)	195.0µs	6.865µs	189.4µs	244.7µs	261.5µs	9.261µs	254.0µs	294.9µs
	Mish (cuda)	118.8µs	4.640µs	115.7µs	149.5µs	205.0µs	75.83µs	173.1µs	840.7µs
bfloat16	ReLU	85.61µs	3.666µs	83.97µs	102.4µs	171.5µs	13.31µs	149.5µs	206.8µs
	Leaky ReLU	91.97µs	66.68µs	79.87µs	754.7µs	209.2µs	268.3µs	151.6µs	2.597ms
	Softplus	99.76µs	1.696µs	98.30µs	106.5µs	159.5µs	16.70µs	146.4µs	275.5µs
	SiLU (jit)	182.8µs	5.078µs	177.2µs	205.8µs	568.9µs	122.0µs	541.7µs	1.560ms
	SiLU (native)	88.71µs	2.927µs	86.02µs	103.4µs	171.5µs	13.83µs	147.5µs	228.4µs
	Mish (naive)	243.1µs	4.139µs	239.6µs	260.1µs	482.6µs	177.1µs	458.8µs	2.214ms
	Mish (jit)	259.7µs	33.09µs	252.9µs	584.7µs	781.3µs	191.0µs	759.8µs	2.681ms
	Mish (cuda)	259.7µs	33.09µs	252.9µs	584.7µs	781.3µs	191.0µs	759.8µs	2.681ms
float32	ReLU	129.4µs	3.984µs	126.0µs	146.4µs	220.9µs	25.02µs	216.1µs	448.5µs
	Leaky ReLU	130.7µs	4.009µs	128.0µs	148.5µs	234.1µs	158.8µs	216.1µs	1.812ms
	Softplus	138.5µs	78.90µs	128.0µs	922.6µs	238.2µs	197.8µs	217.1µs	2.207ms
	SiLU (jit)	299.8µs	7.123µs	294.9µs	341.0µs	958.0µs	127.0µs	932.9µs	1.993ms
	SiLU (native)	133.1µs	3.747µs	124.9µs	146.4µs	217.9µs	1.215µs	217.1µs	228.4µs
	Mish (naive)	398.7µs	55.39µs	386.0µs	938.0µs	856.2µs	243.4µs	819.2µs	2.613ms
	Mish (jit)	407.7µs	28.66µs	399.4µs	641.0µs	1.323ms	184.6µs	1.295ms	3.063ms
	Mish (cuda)	142.3µs	5.070µs	138.2µs	164.9µs	281.1µs	557.4µs	222.2µs	5.826ms
float64	ReLU	326.7µs	28.28µs	315.4µs	566.3µs	497.3µs	189.2µs	470.0µs	2.343ms
	Leaky ReLU	321.8µs	6.283µs	315.4µs	349.2µs	472.0µs	1.451µs	469.0µs	479.2µs
	Softplus	316.9µs	4.720µs	310.3µs	335.9µs	469.5µs	1.753µs	466.9µs	478.2µs
	SiLU (jit)	727.0µs	10.50µs	715.8µs	786.4µs	2.481ms	302.3µs	2.412ms	4.910ms
	SiLU (native)	322.0µs	15.89µs	316.4µs	475.1µs	472.8µs	1.526µs	469.0µs	477.2µs
	Mish (naive)	997.1µs	18.65µs	987.1µs	1.176ms	1.990ms	3.833µs	1.983ms	2.002ms
	Mish (jit)	1.010ms	7.377µs	1.000ms	1.036ms	3.421ms	229.3µs	3.378ms	5.051ms
	Mish (cuda)	346.5µs	4.061µs	343.0µs	366.6µs	471.2µs	827.5ns	470.0µs	473.1µs

Lower is better. Bolded passes are faster than PyTorch 1.9 and italic passes are slower than PyTorch 1.9. Fastest pass and those statistically indistinguishable are highlighted. All at the p=0.001 level.

# Conclusion

Overall I am quite pleased with the performance of native Mish. On more modern hardware when training via mixed precision or full precisionUnfortunately, I did not have access to Ampere hardware, and thus was unable to benchmark Nvidia’s TensorFloat32, however I would expect it to behave similar to float16 and float32 performance. native Mish should be comparable if not faster than other native PyTorch activation functions. On older hardware native Mish is still an upgrade in performance over other Mish implementations, however Mish is no longer in contention for best performance.

The one sore spot is CPU performance where it’s best to replace the native Mish implementation with a TorchScript implementation. Hopefully future releases of PyTorch will improve Mish’s CPU forward pass performance closer to SiLU’s.

With the release of PyTorch 1.9 it is time to switch to native Mish.

# References

Diganta Misra. 2019. Mish: A self regularized non-monotonic neural activation function. arXiv:1908.08681.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep Sparse Rectifier Neural Networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR, 315–323.
Qian Liu and Steve Furber. 2016. Noisy Softplus: A Biology Inspired Activation Function. In Neural Information Processing. DOI:10.1007/978-3-319-46681-1_49
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. arXiv:1702.03118.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (November 1997), 1735–1780. DOI:10.1162/neco.1997.9.8.1735
Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv:2004.10934.
Dan Hendrycks and Kevin Gimpel. 2020. Gaussian Error Linear Units (GELUs). arXiv:1606.08415.
Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing.

Inference With fastai

Model Saving, Loading, and Prediction

In this tutorial I cover how to use fast.ai for inference, how to save and load fast.ai models, and how...