Testing Amazon SageMaker Studio Lab

Comparing SageMaker to Google Colab and Kaggle

Benjamin Warner

December 8, 2021

Updated: Jul 20, 2022

13 min. read

A week ago on December 1st, Amazon soft-launched SageMaker Studio Lab, a free and simplified version of SageMaker Studio which does not require a credit card or either a AWS or Amazon accountAt the time of publishing, there is a waitlist to get an account. Amazon states it should take one to five business days to get approval. My application was approved the same day.. SageMaker Studio Lab provides a CPU instance with a time limit of hours and a GPU instance with a time limit of four hours. It joins Google Colab, Kaggle, and Paperspace in the free machine and deep learning compute space.

The obvious question becomes, how does SageMaker Studio Lab stack against the competition? And should you start using it?

In this post I will compare and benchmark training neural networks on SageMaker Studio Lab against Google Colab, Colab Pro, and KaggleI have not used Colab Pro+ or Paperspace Gradient, so will not include them in this comparison. using image and NLP classification tasks.

# Comparison with Colab and Kaggle

Like Google Colab and Kaggle, Studio Lab offers both CPU and GPU instances: a T3.xlarge CPU instance with a 12 hour runtime and a G4dn.xlarge GPU instance with a 4 hour runtime. I will limit this comparison to the GPU instances provided by all three services.

GPU Instance Comparison Overview

Service	Instance	CPU	Generation	RAM	GPU	GPU RAM	Max Scratch	Storage	Max Runtime	Cost
Studio Lab	-	4 CPUs	Cascade Lake	16GB	Tesla T4	15109MiB	-	15GB	4 hours	Free
Colab	-	2 CPUs	Varies	13GB	Tesla K80	11441MiB	60GB	15GB	12 hours	Free
Colab Pro	Normal	2 CPUs	Varies	13GB	Tesla P100	16280MiB	124GB	15GB	24 hours	$10/m
Colab Pro	High RAM	4 CPUs	Varies	26GB	Tesla P100	16280MiB	124GB	15GB	24 hours	$10/m
Kaggle	-	2 CPUs	Varies	13GB	Tesla P100	16280MiB	90GB	20GB	9 hours	Free

See below for clarifications.

Some expansion on the table:

SageMaker Studio Lab only has persistent storage, but unlike Google Drive, it is fast enough to train from.
I’ve observed the following CPU generations on Colab and Kaggle: Haswell, Broadwell, Skylake, and Cascade Lake. Anecdotally, most seem to be Haswell or Broadwell.
Colab scratch disk varies per instance, the table reports my recent maximum observed sizes.
Colab’s persistent storage is Google Drive’s free allocation.
Colab Pro could assign a Tesla T4 or Tesla K80. I’ve consistently been assigned Tesla P100s since Tesla V100s were relegated to Pro+. I’ve only been assigned a K80 once. ~~A T4 has not been assigned to me in months.~~ The day after publishing this post Google assigned me the first T4 in months, but between then and the update, they’ve all been P100s.
The free version of Colab could also assign a Tesla T4 or Tesla P100, but in my very limited recent usage of it I’ve only been assigned K80s.
The Colab FAQ and Colab Pro/Pro+ FAQ say different things about maximum runtime. I’ve observed the max runtime on the Pro FAQ to be accurate, but others have not.
Colab Pro can have a pop-up to verify you are at the computer, but once checked it is not shown again.
Kaggle’s persistent storage is 20GB per notebook. There is also 100GBs per account for private dataset storage.
Kaggle has a maximum weekly GPU runtime which varies based on total usage, but is around 40 hours per week, plus or minus a few hours.

# SageMaker Studio Lab

Launching a SageMaker Studio Lab results in a lightly modified JupyterLab instance with a few extensions such as Git installed, accented with Studio Lab purple.

The Studio Lab JupyterLab environment. Some discoloration due to image processing.

In my limited testing, SageMaker Studio Lab’s JupyterLab behaves exactly as a normal installation of JupyterLab does on your own system. It even appears that modifications to JupyterLab and installed python packages persistOne nice feature of persistence means you can install packages and download/pre-process data on a CPU instance, should Amazon demote you to the bottom of a queue due to GPU usage in the future. Currently there appears to be no queue..

For example, I was able to install a python language server and markdown spellchecker from this Jupyterlab Awesome List, although I did not attempt to install any extensions which require NodeJSI’ve heard that installing NodeJS via conda doesn’t work for Jupyter extensions.. This persistence of packages does bring up the question of whether Amazon will update pre-installed packages like PyTorch, or if maintaining an updated environment falls completely on the userI’ve heard anecdotally one instance of a bricked environment where the only solution was to delete the Studio Lab account and reapply for a new account..

It is possible that Amazon delayed destroying my instance or will upgrade the underlying image sometime in the future, removing the custom installed packages and extensions. I will update this post should that occur. But for now, Studio Lab is the most customizable service of the three.

I installed python packages this way and from JupyterLab terminal without issue, although from the FAQ I am not sure if JupyterLab terminal uses the correct installation method or not.

# Benchmarking

⊕Code and raw results are available here. As Kaggle’s GPU instance and Colab Pro’s GPU instance with 2 CPUs are almost, if not the exact same set of machines from GCP, I have elected to only run the benchmark on Colab Pro’s 2 CPU instance and will let those results stand in for Kaggle.

# Datasets and Models

I selected two small datasets for benchmarking SageMaker Studio Lab against Colab: Imagenette for computer vision and IMDB from Hugging Face for NLP. To reduce training times, I randomly sample twenty percent of the IMDB training and test set.

For computer vision, I selected XResNet and XSE-ResNet—fastai versions of ResNetKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). with the architectural improvements from the Bag of TricksTong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. 2019. Bag of Tricks for Image Classification with Convolutional Neural Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 558–567. DOI:10.1109/CVPR.2019.00065 paper and squeeze and excitationJie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. 2020. Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 8 (2020), 2011–2023. DOI:10.1109/TPAMI.2019.2913372 (for the latter).

For NLP, I used the Hugging Face implementation of RoBERTaYinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692..

# Training Setup

I trained Imagenette using fast.ai with random resized crop and random horizontal flip from fast.ai’s augmentations.

To train IMDB, I used fast.ai and Hugging Face Transformers via blurr. In addition to adding Transformers training and inference support to fast.ai, blurr integrates per batch tokenization and fast.ai’s text dataloader which randomly sorts the dataset based on sequence length to minimize padding while training.

I trained XSE-ResNet50 and RoBERTa base using both single and mixed precision. XSE-ResNet50 was trained at an image size of 224 pixels with a batch size of 64 for mixed and 32 for single precision. RoBERTa was trained at a batch sized of 16 for mixed precision and 8 for single precision.

To simulate a CPU bound task, I also trained a XResNet18 model at an image size of 128 pixels and batch size of 64.

# Results

As one would anticipate from the GPUs technical specs, the Tesla T4 on SageMaker Studio Lab outperforms the Tesla P100 on Google Colab when training in mixed precision, but lags when training full single precision models. I will primarily compare the Colab Pro High RAM P100Shortened to Colab Pro P100. Likewise, I will shorten Colab Pro High RAM Tesla T4 to Colab Pro T4. instance with SageMaker Studio Lab as they are the most similar sans GPU.

The Colab Pro High RAM Tesla T4 instance performs almost identically to SageMaker Studio Lab in the first benchmark, which is no surprise as they both have a Tesla T4. But in later benchmarks Studio Lab is statistically faster than Colab Pro T4 in some GPU actions, despite being the same hardware. Perhaps this is just luck of the draw.

Either way, Colab Pro T4 remains significantly faster than Colab Pro P100, so it is unfortunate that Google seems to rarely assign the fasterAnd cheaper. A Tesla T4 costs $0.35/hr per GPU on [GCP](https://cloud.google.com/compute/gpus-pricing) while the P100 costs$ 1.46/hr per GPU. Prices can vary per GCP region. mixed precision GPU and prefers Tesla P100 instances.

# XSE-ResNet50

Colab Pro P100 with SageMaker Studio Lab, XSE-ResNet50 trains 17.4% faster overall on Studio Lab. When looking at just the training loop, which is the batch plus draw action, Studio Lab is 19.6 percent faster than Colab Pro P100. Studio Lab is faster in all actions with one notable exception: the backwards pass where Studio Lab is 10.4% slower then Colab Pro P100.

XSE-ResNet50 Mixed Precision Imagenette Simple Profiler Results

		Mean Duration				Duration Std Dev				Total Time
Phase	Action	Colab Pro	Pro P100	Pro T4	Studio Lab	Colab Pro	Pro P100	Pro T4	Studio Lab	Colab Pro	Pro P100	Pro T4	Studio Lab
fit	fit									689.1 s	673.5 s	559.8 s	556.4 s
	epoch	172.2 s	168.3 s	139.9 s	139.0 s	3.227 s	2.418 s	3.826 s	1.708 s	688.7 s	673.1 s	559.5 s	556.1 s
	train	147.5 s	149.3 s	123.6 s	124.3 s	2.045 s	1.972 s	3.145 s	1.390 s	590.2 s	597.4 s	494.3 s	497.4 s
	validate	24.62 s	18.93 s	16.30 s	14.66 s	1.198 s	446.4ms	682.8ms	386.6ms	98.48 s	75.72 s	65.18 s	58.64 s
train	batch	990.2ms	1.006 s	805.1ms	809.8ms	217.5ms	231.6ms	355.4ms	307.8ms	582.3 s	591.6 s	473.4 s	476.1 s
	step	612.3ms	627.8ms	460.9ms	423.3ms	23.88ms	23.83ms	19.84ms	18.55ms	360.1 s	369.1 s	271.0 s	248.9 s
	backward	309.1ms	307.3ms	279.8ms	339.3ms	129.6ms	130.2ms	207.7ms	188.5ms	181.7 s	180.7 s	164.5 s	199.5 s
	pred	50.30ms	52.64ms	46.47ms	33.54ms	80.19ms	82.73ms	124.9ms	106.8ms	29.57 s	30.95 s	27.33 s	19.72 s
	draw	14.86ms	14.64ms	14.42ms	10.70ms	52.75ms	71.08ms	69.03ms	49.37ms	8.738 s	8.608 s	8.482 s	6.290 s
	zero grad	2.087ms	2.484ms	2.324ms	1.899ms	330.6µs	351.4µs	476.0µs	72.35µs	1.227 s	1.461 s	1.367 s	1.117 s
	loss	1.412ms	1.191ms	1.077ms	998.7µs	156.5µs	127.6µs	296.7µs	173.5µs	830.2ms	700.2ms	633.6ms	587.2ms
valid	batch	184.8ms	82.08ms	75.39ms	54.24ms	142.3ms	138.7ms	148.9ms	119.6ms	45.83 s	20.36 s	18.70 s	13.45 s
	pred	111.0ms	55.52ms	49.08ms	35.02ms	122.1ms	66.76ms	90.63ms	84.52ms	27.53 s	13.77 s	12.17 s	8.686 s
	draw	71.74ms	25.09ms	25.01ms	18.04ms	71.54ms	116.1ms	114.2ms	82.44ms	17.79 s	6.221 s	6.201 s	4.475 s
	loss	1.642ms	1.253ms	1.130ms	990.5µs	1.967ms	411.8µs	795.7µs	1.323ms	407.1ms	310.8ms	280.2ms	245.7ms

Kaggle results should be equivalent to Colab Pro, which is the normal P100 instance. Pro P100 is the Colab Pro High RAM P100 instance. Pro T4 is the Colab Pro High RAM T4 instance. Results measured with Simple Profiler Callback.

When training XSE-ResNet50 in single precision, the results flip with Studio Lab performing 95.9 percent slower then Colab Pro P100. The training loop is 93.8 percent slower then Colab Pro P100, although this is solely due to the backward pass and optimizer step where Studio Lab is 105 percent slower then Colab Pro P100 while 41.1 percent faster during all other actions.

XSE-ResNet50 Single Precision Imagenette Simple Profiler Results

		Mean Duration				Duration Std Dev				Total Time
Phase	Action	Colab Pro	Pro P100	Pro T4	Studio Lab	Colab Pro	Pro P100	Pro T4	Studio Lab	Colab Pro	Pro P100	Pro T4	Studio Lab
fit	fit									663.4 s	645.7 s	1.391ks	1.265ks
	epoch	165.8 s	161.3 s	347.6 s	316.3 s	2.120 s	2.188 s	3.918 s	9.994 s	663.1 s	645.3 s	1.390ks	1.265ks
	train	142.7 s	142.7 s	309.1 s	282.7 s	1.657 s	1.710 s	3.423 s	8.503 s	570.8 s	571.0 s	1.236ks	1.131ks
	validate	23.07 s	18.57 s	38.49 s	33.61 s	465.2ms	477.5ms	497.5ms	1.580 s	92.30 s	74.29 s	153.9 s	134.4 s
train	batch	953.3ms	961.1ms	2.061 s	1.881 s	190.7ms	205.3ms	293.3ms	291.1ms	560.5 s	565.1 s	1.212ks	1.106ks
	step	626.6ms	634.0ms	1.437 s	1.222 s	24.29ms	24.00ms	59.45ms	68.18ms	368.4 s	372.8 s	844.9 s	718.3 s
	backward	262.4ms	260.3ms	556.0ms	615.1ms	116.9ms	117.3ms	190.2ms	188.3ms	154.3 s	153.1 s	326.9 s	361.7 s
	pred	45.87ms	48.43ms	50.19ms	30.54ms	66.68ms	68.29ms	120.5ms	116.9ms	26.97 s	28.48 s	29.51 s	17.96 s
	draw	14.84ms	14.58ms	14.41ms	10.75ms	51.51ms	70.71ms	69.64ms	49.31ms	8.727 s	8.573 s	8.471 s	6.324 s
	zero grad	2.153ms	2.519ms	2.418ms	2.042ms	362.9µs	375.2µs	478.3µs	74.03µs	1.266 s	1.481 s	1.422 s	1.200 s
	loss	1.298ms	1.109ms	979.8µs	914.1µs	343.9µs	127.9µs	123.6µs	156.4µs	763.1ms	652.2ms	576.1ms	537.5ms
valid	batch	162.4ms	80.50ms	68.12ms	44.00ms	130.5ms	146.1ms	147.7ms	130.0ms	40.26 s	19.96 s	16.89 s	10.91 s
	pred	92.53ms	53.83ms	43.95ms	26.26ms	107.5ms	74.66ms	79.61ms	95.61ms	22.95 s	13.35 s	10.90 s	6.513 s
	draw	67.73ms	25.21ms	23.35ms	17.12ms	73.53ms	117.0ms	116.4ms	83.93ms	16.80 s	6.251 s	5.791 s	4.245 s
	loss	1.758ms	1.231ms	702.6µs	540.3µs	2.174ms	1.460ms	318.8µs	198.8µs	435.9ms	305.2ms	174.2ms	134.0ms

# RoBERTa

Training RoBERTa in mixed precision Studio Lab pulls further ahead of Colab Pro P100, performing 29.1 percent faster. Studio Lab is 32.1 percent faster than Colab Pro P100 during the training loop and is faster in all actions except for the calculating the loss where Studio Lab is 66.7 percent slower.

RoBERTa Mixed Precision Benchmark Results

		Mean Duration				Duration Std Dev				Total Time
Phase	Action	Colab Pro	Pro P100	Pro T4	Studio Lab	Colab Pro	Pro P100	Pro T4	Studio Lab	Colab Pro	Pro P100	Pro T4	SageMaker
fit	fit									828.9 s	842.3 s	647.7 s	596.8 s
	epoch	207.2 s	210.6 s	161.9 s	149.2 s	151.0ms	62.80ms	2.517 s	458.8ms	828.9 s	842.2 s	647.7 s	596.7 s
	train	151.7 s	154.9 s	119.7 s	110.7 s	131.4ms	58.04ms	2.443 s	329.2ms	606.9 s	619.6 s	478.8 s	442.9 s
	validate	55.49 s	55.66 s	42.22 s	38.45 s	35.57ms	18.38ms	76.20ms	508.5ms	222.0 s	222.6 s	168.9 s	153.8 s
train	batch	477.3ms	491.8ms	362.0ms	334.6ms	249.9ms	252.0ms	203.6ms	181.1ms	595.6 s	613.8 s	451.8 s	417.6 s
	step	368.4ms	368.2ms	265.0ms	248.8ms	298.4ms	298.0ms	227.5ms	205.1ms	459.8 s	459.5 s	330.7 s	310.5 s
	backward	75.53ms	88.24ms	63.36ms	60.07ms	70.60ms	70.00ms	43.12ms	38.08ms	94.27 s	110.1 s	79.07 s	74.97 s
	pred	23.59ms	25.03ms	22.55ms	16.65ms	2.843ms	4.153ms	5.166ms	3.348ms	29.44 s	31.24 s	28.14 s	20.78 s
	draw	6.212ms	6.773ms	5.802ms	4.176ms	12.92ms	23.15ms	25.29ms	122.2µs	7.752 s	8.453 s	7.241 s	5.212 s
	zero grad	1.813ms	2.169ms	3.912ms	3.615ms	377.3µs	370.1µs	390.6µs	15.83ms	2.263 s	2.706 s	4.883 s	4.512 s
	loss	1.519ms	1.260ms	1.282ms	1.222ms	264.3µs	161.5µs	196.1µs	217.2µs	1.896 s	1.573 s	1.600 s	1.525 s
valid	batch	26.10ms	27.77ms	23.18ms	16.62ms	15.27ms	26.57ms	28.78ms	19.19ms	32.68 s	34.76 s	29.02 s	20.81 s
	pred	18.92ms	20.31ms	16.89ms	12.59ms	2.346ms	4.193ms	4.098ms	2.944ms	23.68 s	25.43 s	21.15 s	15.77 s
	draw	5.900ms	6.347ms	5.292ms	3.156ms	13.46ms	24.95ms	26.90ms	17.10ms	7.387 s	7.946 s	6.625 s	3.951 s
	loss	1.062ms	927.2µs	841.1µs	754.8µs	516.1µs	216.2µs	195.1µs	915.0µs	1.330 s	1.161 s	1.053 s	945.1ms

In single precision, the results again flip with Studio Lab training 72.2 percent slower overall then Colab Pro P100. The training loop is 67.9 percent slower then Colab Pro P100. And when training XSE-ResNet50 in single precision, this is due to the backward pass and optimizer step being 83.0 percent slower while Studio Lab is 27.7 percent faster performing all other actions.

RoBERTa Single Precision Benchmark Results

		Mean Duration				Duration Std Dev				Total Time
Phase	Action	Colab Pro	Pro P100	Pro T4	Studio Lab	Colab Pro	Pro P100	Pro T4	Studio Lab	Colab Pro	Pro P100	Pro T4	Studio Lab
fit	fit									877.0 s	885.7 s	1.602ks	1.525ks
	epoch	219.2 s	221.4 s	400.6 s	381.1 s	122.3ms	124.9ms	991.9ms	13.67 s	877.0 s	885.6 s	1.602ks	1.525ks
	train	163.8 s	165.7 s	300.8 s	287.9 s	94.80ms	95.66ms	347.8ms	10.89 s	655.0 s	662.7 s	1.203ks	1.152ks
	validate	55.48 s	55.72 s	99.77 s	93.20 s	41.20ms	42.00ms	660.9ms	2.794 s	221.9 s	222.9 s	399.1 s	372.8 s
train	batch	250.3ms	258.7ms	459.6ms	440.2ms	123.9ms	127.5ms	244.5ms	232.9ms	625.9 s	646.7 s	1.149ks	1.101ks
	step	114.7ms	116.2ms	244.0ms	236.5ms	78.62ms	75.19ms	308.3ms	295.3ms	286.9 s	290.4 s	610.0 s	591.3 s
	backward	108.9ms	112.4ms	186.6ms	181.6ms	163.2ms	160.5ms	120.7ms	116.5ms	272.2 s	280.9 s	466.6 s	453.9 s
	pred	17.98ms	20.64ms	18.62ms	13.60ms	2.377ms	3.465ms	3.679ms	2.383ms	44.96 s	51.59 s	46.55 s	33.99 s
	draw	5.693ms	6.211ms	5.038ms	4.246ms	8.306ms	14.41ms	17.33ms	95.83µs	14.23 s	15.53 s	12.59 s	10.62 s
	zero grad	1.830ms	2.135ms	4.091ms	3.104ms	350.8µs	341.4µs	300.1µs	9.663ms	4.575 s	5.338 s	10.23 s	7.760 s
	loss	1.056ms	1.035ms	1.094ms	1.090ms	297.1µs	148.4µs	155.3µs	156.7µs	2.640 s	2.588 s	2.735 s	2.725 s
valid	batch	20.41ms	22.57ms	19.22ms	13.12ms	9.128ms	16.53ms	18.10ms	11.33ms	51.03 s	56.42 s	48.06 s	32.80 s
	pred	14.11ms	16.40ms	13.94ms	9.745ms	1.738ms	3.197ms	3.018ms	2.036ms	35.26 s	41.01 s	34.84 s	24.36 s
	draw	5.390ms	5.254ms	4.466ms	2.710ms	8.345ms	15.43ms	17.35ms	10.43ms	13.47 s	13.13 s	11.16 s	6.774 s
	loss	751.9µs	760.2µs	693.6µs	564.0µs	203.2µs	250.4µs	157.6µs	235.0µs	1.880 s	1.900 s	1.734 s	1.410 s

Oddly, the Colab Pro High RAM P100 instance trained slower than the normal Colab Pro instance, despite more CPU cores and CPU RAM and the same GPU. However, it was not a large difference and probably not significant or repeatable.

# XResNet18

For this benchmark, it’s important to know what the draw action is measuring. It’s the time from before drawing a batch from the dataloader to before starting the batch actionWhich includes the forward & backward pass, loss, and optimizer step & zero grad actions.. The dataloader is set to the default prefetch_factor value of two, which means each worker attempts to load two batches in advance before the training loop calls for them.

The lower the draw action is, the better the instance’s CPU is able to keep up with demand.

Here the results are as expected, more CPU cores means a lower draw time and newer CPUs outperform older CPUs at the same core count during validation. However, I should note that excluding the two CPU instance, the only statistically significant validation draw differenceAccording to the t-test. was between Colab Pro P100 and Studio Lab. Which makes sense as Colab CPU generations vary and so far Studio Lab’s has been consistent.

XResNet18 Benchmark Results

		Mean Duration				Duration Std Dev
Phase	Action	Colab Pro	Pro P100	Pro T4	Studio Lab	Colab Pro	Pro P100	Pro T4	Studio Lab
fit	epoch	51.13 s	32.42 s	28.58 s	24.27 s	170.1ms	146.2ms	403.5ms	266.9ms
	train	35.69 s	22.65 s	20.01 s	16.91 s	52.29ms	148.3ms	210.7ms	216.9ms
	validate	15.43 s	9.766 s	8.566 s	7.366 s	129.2ms	60.54ms	206.7ms	59.39ms
train	batch	230.6ms	140.8ms	124.0ms	102.7ms	144.8ms	100.2ms	102.3ms	89.08ms
	draw	146.8ms	35.22ms	34.64ms	34.08ms	135.8ms	78.72ms	76.97ms	66.18ms
valid	batch	238.7ms	147.2ms	125.6ms	105.6ms	165.5ms	184.2ms	160.5ms	150.6ms
	draw	215.2ms	117.9ms	101.3ms	90.18ms	165.0ms	182.9ms	158.6ms	151.5ms

# Colab Tesla K80

Since the free Colab instance’s Tesla K80 has one-fourth less RAM then all the other GPUs, I reduced the mixed precision batch size by one-fourth also, too 48 and 12 for Imagenette and IMDB, respectively. This isn’t a direct comparison in performance, but rather a real-world comparison that users would see. I did not run any single precision tests.

I ran the Imagenette benchmark for two epochs and reduced the IMDB dataset from twenty percent sample to a ten percent sample and reduced the training length to one epoch.

The Colab K80 took roughly double the time then all the Colab Pro instances to train on half the number of Imagenette epochs. And equivalent IMDB training would have taken over three times longer on the Colab K80 verses the Colab P100.

If possible, one should stay away from training using a K80 on anything other then small models.

XResNet & RoBERTa Colab K80 Benchmark Results

XResNet50		Duration			RoBERTa		Duration
Phase	Action	Mean	Std Dev	Total Time	Phase	Action	Mean	Std Dev	Total Time
fit	fit			1.330ks	fit	fit			626.3 s
	epoch	664.9 s	9.262 s	1.330ks		epoch	626.3 s		626.3 s
	train	593.1 s	7.196 s	1.186ks		train	465.6 s		465.6 s
	validate	71.71 s	2.065 s	143.4 s		validate	160.6 s		160.6 s
train	batch	2.959 s	698.7ms	1.166ks	train	batch	2.201 s	1.030 s	457.8 s
	step	1.685 s	80.89ms	663.9 s		backward	1.503 s	793.6ms	312.5 s
	backward	1.155 s	487.7ms	455.0 s		pred	554.2ms	271.0ms	115.3 s
	pred	99.67ms	257.7ms	39.27 s		step	115.7ms	57.66ms	24.06 s
	draw	14.02ms	44.24ms	5.525 s		loss	11.01ms	5.196ms	2.290 s
	zero_grad	2.799ms	719.0µs	1.103 s		draw	10.51ms	17.78ms	2.187 s
	loss	2.143ms	697.3µs	844.2ms		zero_grad	6.685ms	1.352ms	1.391 s
valid	batch	127.8ms	386.6ms	20.95 s	valid	batch	551.0ms	249.1ms	115.2 s
	pred	105.7ms	378.7ms	17.33 s		pred	539.2ms	245.4ms	112.7 s
	draw	19.91ms	71.68ms	3.265 s		draw	8.380ms	16.78ms	1.751 s
	loss	1.870ms	1.307ms	306.7ms		loss	3.218ms	1.468ms	672.5ms

Results measured with Simple Profiler Callback.

# Final Thoughts

Overall, I think SageMaker Studio Lab is a good competitor in the free machine learning compute space. Especially if one has been using the free tier of Colab and training models on K80s, then it’s almost a straight upgrade across the board.

For those not stuck using the free tier of Colab, SageMaker Studio Lab could be a useful addition to machine learning workflows as an augmentation to Kaggle or Colab Pro. The 17.4 to 32.1 percent faster training in mixed precision than Kaggle or Colab ProIf Colab Pro would assign Tesla T4s more often, then the mixed precision speed advantage SageMaker Studio Lab has over Colab Pro would start to evaporate. means less time waiting for models to train while iterating on an idea. Then once iteration has ended, move training to Kaggle or Colab Pro for longer runtime. If the dataset can fit into the 15GB of storage.

SageMaker Studio Lab is also a strong contender for those just starting out with deep learning due to both the faster training speedSageMaker Studio Lab guides will need to explicitly and repeatedly mention the necessity of training in mixed precision. and persistent storage, which means the environment needs only be set up once, allowing students to focus on learning and not continual package management.

# References

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. 2019. Bag of Tricks for Image Classification with Convolutional Neural Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 558–567. DOI:10.1109/CVPR.2019.00065
Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. 2020. Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 8 (2020), 2011–2023. DOI:10.1109/TPAMI.2019.2913372
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.

Inference With fastai

Model Saving, Loading, and Prediction

In this tutorial I cover how to use fast.ai for inference, how to save and load fast.ai models, and how to avoid the few pitfalls...

Detecting Cloud Cover Via Sentinel-2 Satellite Data

My Top-10 Percent Solution to DrivenData’s On CloudN Competition

In this post I will give an overview of my solution, explore some of my alternate solutions which didn’t perform...