The weird case of MNIST
MNIST is simple, clean, and easy to handle, which makes it perfect for beginners. But here's the thing: how often do we download it manually?
For most of us, especially Python users, the answer is almost never. Thanks to tools like torchvision and tensorflow.datasets, grabbing MNIST is just a few lines of code away. A simple torchvision.datasets.MNIST
call handles everything—downloading, caching, and loading the data.
But what if you're working with C/C++ or another lower-level language where such luxuries don't exist? You'd naturally head to Google and search for "MNIST dataset download."
The first hit usually leads you to the official MNIST website. It looks promising. You find the download links, click them eagerly... and—wait. 403 Forbidden. The server rejects you.
Confused? You're not alone.
The question arises: why can't you manually download one of the most famous datasets in the world? And more importantly, who can?
After scouring Stack Overflow and GitHub issues, you discover a workaround:
curl -A "Mozilla/5.0" https://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
This fakes a browser request, bypassing the restrictions.
Except... it doesn’t always work. You might still get that frustrating 403. The problem seems to have worsened with recent server updates.
So how do libraries like torchvision manage to fetch MNIST without any hiccups?
They don't rely on Yann LeCun's server alone.
Torchvision, for instance, has multiple mirrors. If the official site blocks the request, the library simply tries another source. Here's a snippet from torchvision:
mirrors = ["https://ossci-datasets.s3.amazonaws.com/mnist"]
These mirrors are trusted and maintained to ensure seamless downloads.
While mirrors are a clever workaround, the root issue remains. The official MNIST site should:
Access to MNIST should be simple. After all, it's foundational to machine learning education.
I'm currently developing a MNIST-like library for C/C++ (and CUDA), similar to torchvision.datasets
. If you're interested, feel free to contribute to the repo. Let's make MNIST accessible for everyone, across all languages.