The weird case of MNIST

MNIST is simple, clean, and easy to handle, which makes it perfect for beginners. But here's the thing: how often do we download it manually?

The Convenience of Python Libraries

For most of us, especially Python users, the answer is almost never. Thanks to tools like torchvision and tensorflow.datasets, grabbing MNIST is just a few lines of code away. A simple torchvision.datasets.MNIST call handles everything—downloading, caching, and loading the data.

But what if you're working with C/C++ or another lower-level language where such luxuries don't exist? You'd naturally head to Google and search for "MNIST dataset download."

The Download Roadblock

The first hit usually leads you to the official MNIST website. It looks promising. You find the download links, click them eagerly... and—wait. 403 Forbidden. The server rejects you.

Confused? You're not alone.

The question arises: why can't you manually download one of the most famous datasets in the world? And more importantly, who can?

The Strange Fix

After scouring Stack Overflow and GitHub issues, you discover a workaround:

curl -A "Mozilla/5.0" https://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz

This fakes a browser request, bypassing the restrictions.

Except... it doesn’t always work. You might still get that frustrating 403. The problem seems to have worsened with recent server updates.

How Python Libraries Bypass This

So how do libraries like torchvision manage to fetch MNIST without any hiccups?

They don't rely on Yann LeCun's server alone.

Torchvision, for instance, has multiple mirrors. If the official site blocks the request, the library simply tries another source. Here's a snippet from torchvision:

mirrors = ["https://ossci-datasets.s3.amazonaws.com/mnist"]

These mirrors are trusted and maintained to ensure seamless downloads.

A Call for Change

While mirrors are a clever workaround, the root issue remains. The official MNIST site should:

Access to MNIST should be simple. After all, it's foundational to machine learning education.

Final Word

I'm currently developing a MNIST-like library for C/C++ (and CUDA), similar to torchvision.datasets. If you're interested, feel free to contribute to the repo. Let's make MNIST accessible for everyone, across all languages.