MNIST Classifier in Pure x86-64 Assembly
Sole developer · Personal project, iterated over several weeks
systemsmlassembly 96.5% test accuracy on MNIST (9,652/10,000)27 KB training binary with sub-millisecond inference
Problem / Context
I wanted to understand what a forward pass, backpropagation step, and SGD update actually look like when you write every instruction yourself. The target was x86-64 NASM on Linux, using only syscalls and floating-point registers.
Approach
- Built a 784 → 32 → 10 dense network (sigmoid hidden layer, softmax output) that trains on the full 60,000-sample MNIST dataset and evaluates per-digit accuracy on the 10,000-sample test set.
- Implemented
expandlnas leaf functions using range reduction and direct IEEE 754 bit manipulation. Roughly 14 digits of precision with zero call overhead in the hot path. - Placed all weights, gradient accumulators, and training data in BSS at fixed addresses. The entire ~445 MB data footprint is predictable and sequential.
- Used Xavier initialization, Fisher-Yates shuffling per epoch, mini-batch gradient accumulation (batch size 32), and numerically stable softmax.
- Followed the SysV AMD64 ABI throughout: callee-saved registers, 16-byte stack alignment, arguments in the right registers. This was a deliberate bet on interoperability.
Results
- 96.5% test accuracy after 30 epochs, competitive with equivalent architectures in high-level frameworks. Training finishes in 60–135 seconds depending on hardware.
- A companion C HTTP server links five NASM object files for the inference path with no wrappers or marshalling needed. A web app lets you draw digits on a canvas and get predictions in under a millisecond.
- The training binary is 27 KB.
What I’d Do Next
- Vectorize dot products with AVX2 to measure the real-world speedup in the training loop.
- Add a convolutional layer and see what convolution looks like in handwritten assembly.
- Experiment with int8 weight quantization on the inference path to test whether accuracy holds with a smaller weight file.