Galax.dev

I’ve been viewing tinygrad as a framework for almost two years, back when it depended on NumPy. Today, tinygrad is closing in on an AMD $1M contract.

It’s really fun to work with: I can just download 1.8MB of code and start training models.

And we’re almost there: speeds with BEAM=2 are close to matching PyTorch kernels (especially on AMD).

Rewriting the same training loop got kind of boring, so I tried to reimplement the Learner from fastai22p2 in tinygrad:

#... Truncated code

def transforms(s: Tensor):
    x = "image"
    s[x] = [pil_to_tensor(o, pixel_format="Grayscale").flatten() for o in s[x]]
    return s

BATCH_SIZE = 128
def main():
    ds = load_dataset("zalando-datasets/fashion_mnist")
    tds = ds.with_transform(transforms)

    dls = DataLoaders.from_dd(tds, BATCH_SIZE)
    model = TinyMLP()

    @TinyJit
    def accuracy(preds, y):
        return (preds.argmax(axis=1) == y).mean()

    cbs = [TrainCB(), TqdmCB(), MetricsCB(accuracy=accuracy)]
    learn = Learner(model, dls, loss_func=loss_func, lr=LR, cbs=cbs)
    learn.fit(1)

Looks nice, and it trains.

uv run examples/train_mnist.py

Epoch:0 - Train Loss: 0.699: 100%|███████| 468/468 [00:35<00:00, 13.14it/s]
MetricsCB - accuracy: 0.6882
Epoch:0 - Valid Loss: 0.729: 100%|███████| 78/78 [00:05<00:00, 13.92it/s]
MetricsCB - accuracy: 0.7362

But it’s very slow. It’s supposed to match PyTorch kernels. We’re only getting 13 it/s with 128 images per batch, but PyTorch can get:

Epoch:0 - Train Loss: 0.731: 100%|███████| 469/469 [00:01<00:00, 248.81it/s]

Oof, that’s 19x faster. Why is that? Let’s look at the code. We load our Fashion-MNIST dataset from Hugging Face, and it comes in {'image': PIL.Image, 'label': int} format.

So it’s simple: convert PIL.Image to NumPy and copy to a Tensor. In fact, it’s the same thing PyTorch does.

# handle PIL Image
img = torch.as_tensor(np.array(pic, copy=True))
img = img.view(pic.size[1], pic.size[0], F_pil.get_image_num_channels(pic)) # Note: this F.func just does pic.channels
# put it from HWC to CHW format
img = img.permute((2, 0, 1))
return img

But if we do the same thing in tinygrad:

np_from_pil  28x28x1  n=30000 total=0.1574s avg=0.0052ms
np_from_pil  512x512x3  n=3000 total=2.4441s avg=0.8147ms
pt_from_pil  28x28x1  n=30000 total=0.3200s avg=0.0107ms
pt_from_pil  512x512x3  n=3000 total=2.4826s avg=0.8275ms
tg_from_pil  28x28x1  n=30000 total=4.5150s avg=0.1505ms
tg_from_pil  512x512x3  n=3000 total=3.4306s avg=1.1435ms

Summary (avg ms, speedup vs tinygrad):
28x28x1      pt=0.0107ms tg=0.1527ms pt_speedup_vs_tg=14.29x
512x512x3    pt=0.8251ms tg=1.1484ms pt_speedup_vs_tg=1.39x

That’s ~14x slower for small images.

The rest probably comes from a poorly implemented default_collate and DataLoader compared to PyTorch’s.

It reminded me that tinygrad still has a lot of room for improvement. George said they’ll focus more on cool stuff such as GPT-2 speedruns once the AMD contract is done.

Oh boy was I wrong

I did some exploring with opencode and the main bottleneck was Python overhead, so I tried to optimize it by doing less work in the Python layer.

Simply stacking np.array into batches and then converting to Tensor is a lot faster.

for i in range(len(self)): # Note len(self) is number of batches per epoch
    start = i * self.batch_size
    batch = self.data[start : start + self.batch_size]
    if self.transform:
        batch = self.transform(batch)
    yield batch

def transforms(batch: dict[str, np.ndarray]) -> tuple[Tensor, Tensor]:
    x, y = "image", "label"
    return Tensor(batch[x]).reshape(-1, 28 * 28), Tensor(batch[y])

After running training, we went from 14 it/s to:

Epoch:0 - Train Loss: 0.816: 100%|████████| 468/468 [00:04<00:00, 100.21it/s]
Epoch:0 - Valid Loss: 0.678: 100%|████████| 78/78 [00:00<00:00, 132.05it/s]

This is within an acceptable margin compared to PyTorch.

Then I ran a slightly modified version (to match the model, batch size, and optimizer) of examples/beautiful_mnist.py from the tinygrad repo.

test_accuracy: 96.43%: 100%|████████| 7000/7000 [00:03<00:00, 2109.64it/s]

But it doesn’t use the DataLoader; it loads data directly from the dataset.

X_train, Y_train, X_test, Y_test = mnist(fashion=getenv("FASHION"))
X_train = X_train.reshape(-1, 28 * 28) # To match my model
X_test = X_test.reshape(-1, 28 * 28) # To match my model

So the 20x difference is because the data is loaded directly into memory. So let’s implement that as well.

@TinyJit
def sample(self) -> tuple[Tensor, ...]:
    # This will miss some data samples due to the randomness
    samples = Tensor.randint(self.batch_size, high=self.data_len)
    return tuple(col[samples] for col in self.data)

def __iter__(self) -> Iterator[tuple[Tensor, ...]]:
    for _ in range(len(self)):
        yield self.sample()

we add a flag to the DataLoader: in_memory=True. This is nice because it allows @TinyJit for sampling.

Epoch:0 - Train Loss: 0.682: 100%|██████████| 468/468 [00:01<00:00, 346.74it/s]
Epoch:0 - Valid Loss: 0.538: 100%|██████████| 78/78 [00:00<00:00, 629.32it/s]
Epoch:1 - Train Loss: 0.527: 100%|██████████| 468/468 [00:00<00:00, 2714.51it/s]
Epoch:1 - Valid Loss: 0.690: 100%|██████████| 78/78 [00:00<00:00, 2731.15it/s]

Now it’s fast and still using only one CPU.

The issue with in_memory sampling is that it will miss some datapoint due to using Tensor.randint and not having a proper distribution of data. Which is currently the thing I’m working on.

You can see the project here