I’ve been viewing tinygrad as a framework for almost two years, back when it depended on NumPy. Today, tinygrad is closing in on an AMD $1M contract.
It’s really fun to work with: I can just download 1.8MB of code and start training models.
And we’re almost there: speeds with BEAM=2 are close to matching PyTorch kernels (especially on AMD).
Rewriting the same training loop got kind of boring, so I tried to reimplement the Learner from fastai22p2 in tinygrad:
#... Truncated code
def transforms(s: Tensor):
x = "image"
s[x] = [pil_to_tensor(o, pixel_format="Grayscale").flatten() for o in s[x]]
return s
BATCH_SIZE = 128
def main():
ds = load_dataset("zalando-datasets/fashion_mnist")
tds = ds.with_transform(transforms)
dls = DataLoaders.from_dd(tds, BATCH_SIZE)
model = TinyMLP()
@TinyJit
def accuracy(preds, y):
return (preds.argmax(axis=1) == y).mean()
cbs = [TrainCB(), TqdmCB(), MetricsCB(accuracy=accuracy)]
learn = Learner(model, dls, loss_func=loss_func, lr=LR, cbs=cbs)
learn.fit(1) Looks nice, and it trains.
uv run examples/train_mnist.py
Epoch:0 - Train Loss: 0.699: 100%|███████| 468/468 [00:35<00:00, 13.14it/s]
MetricsCB - accuracy: 0.6882
Epoch:0 - Valid Loss: 0.729: 100%|███████| 78/78 [00:05<00:00, 13.92it/s]
MetricsCB - accuracy: 0.7362 But it’s very slow. It’s supposed to match PyTorch kernels. We’re only getting 13 it/s with 128 images per batch, but PyTorch can get:
Epoch:0 - Train Loss: 0.731: 100%|███████| 469/469 [00:01<00:00, 248.81it/s] Oof, that’s 19x faster.
Why is that? Let’s look at the code.
We load our Fashion-MNIST dataset from Hugging Face, and it comes in {'image': PIL.Image, 'label': int} format.
So it’s simple: convert PIL.Image to NumPy and copy to a Tensor.
In fact, it’s the same thing PyTorch does.
# handle PIL Image
img = torch.as_tensor(np.array(pic, copy=True))
img = img.view(pic.size[1], pic.size[0], F_pil.get_image_num_channels(pic)) # Note: this F.func just does pic.channels
# put it from HWC to CHW format
img = img.permute((2, 0, 1))
return img But if we do the same thing in tinygrad:
np_from_pil 28x28x1 n=30000 total=0.1574s avg=0.0052ms
np_from_pil 512x512x3 n=3000 total=2.4441s avg=0.8147ms
pt_from_pil 28x28x1 n=30000 total=0.3200s avg=0.0107ms
pt_from_pil 512x512x3 n=3000 total=2.4826s avg=0.8275ms
tg_from_pil 28x28x1 n=30000 total=4.5150s avg=0.1505ms
tg_from_pil 512x512x3 n=3000 total=3.4306s avg=1.1435ms
Summary (avg ms, speedup vs tinygrad):
28x28x1 pt=0.0107ms tg=0.1527ms pt_speedup_vs_tg=14.29x
512x512x3 pt=0.8251ms tg=1.1484ms pt_speedup_vs_tg=1.39x That’s ~14x slower for small images.
The rest probably comes from a poorly implemented default_collate and DataLoader compared to PyTorch’s.
It reminded me that tinygrad still has a lot of room for improvement. George said they’ll focus more on cool stuff such as GPT-2 speedruns once the AMD contract is done.
Oh boy was I wrong
I did some exploring with opencode and the main bottleneck was Python overhead, so I tried to optimize it by doing less work in the Python layer.
Simply stacking np.array into batches and then converting to Tensor is a lot faster.
for i in range(len(self)): # Note len(self) is number of batches per epoch
start = i * self.batch_size
batch = self.data[start : start + self.batch_size]
if self.transform:
batch = self.transform(batch)
yield batch
def transforms(batch: dict[str, np.ndarray]) -> tuple[Tensor, Tensor]:
x, y = "image", "label"
return Tensor(batch[x]).reshape(-1, 28 * 28), Tensor(batch[y]) After running training, we went from 14 it/s to:
Epoch:0 - Train Loss: 0.816: 100%|████████| 468/468 [00:04<00:00, 100.21it/s]
Epoch:0 - Valid Loss: 0.678: 100%|████████| 78/78 [00:00<00:00, 132.05it/s] This is within an acceptable margin compared to PyTorch.
Then I ran a slightly modified version (to match the model, batch size, and optimizer) of examples/beautiful_mnist.py from the tinygrad repo.
test_accuracy: 96.43%: 100%|████████| 7000/7000 [00:03<00:00, 2109.64it/s] But it doesn’t use the DataLoader; it loads data directly from the dataset.
X_train, Y_train, X_test, Y_test = mnist(fashion=getenv("FASHION"))
X_train = X_train.reshape(-1, 28 * 28) # To match my model
X_test = X_test.reshape(-1, 28 * 28) # To match my model So the 20x difference is because the data is loaded directly into memory. So let’s implement that as well.
@TinyJit
def sample(self) -> tuple[Tensor, ...]:
# This will miss some data samples due to the randomness
samples = Tensor.randint(self.batch_size, high=self.data_len)
return tuple(col[samples] for col in self.data)
def __iter__(self) -> Iterator[tuple[Tensor, ...]]:
for _ in range(len(self)):
yield self.sample() we add a flag to the DataLoader: in_memory=True.
This is nice because it allows @TinyJit for sampling.
Epoch:0 - Train Loss: 0.682: 100%|██████████| 468/468 [00:01<00:00, 346.74it/s]
Epoch:0 - Valid Loss: 0.538: 100%|██████████| 78/78 [00:00<00:00, 629.32it/s]
Epoch:1 - Train Loss: 0.527: 100%|██████████| 468/468 [00:00<00:00, 2714.51it/s]
Epoch:1 - Valid Loss: 0.690: 100%|██████████| 78/78 [00:00<00:00, 2731.15it/s] Now it’s fast and still using only one CPU.
The issue with in_memory sampling is that it will miss some datapoint due to using Tensor.randint and not having a proper distribution of data. Which is currently the thing I’m working on.
You can see the project here