MNIST Classifier in Pure x86-64 Assembly

Sole developer · Personal project, iterated over several weeks

systemsmlassembly 96.5% test accuracy on MNIST (9,652/10,000)27 KB training binary with sub-millisecond inference

Problem / Context

I wanted to understand what a forward pass, backpropagation step, and SGD update actually look like when you write every instruction yourself. The target was x86-64 NASM on Linux, using only syscalls and floating-point registers.

Approach

Built a 784 → 32 → 10 dense network (sigmoid hidden layer, softmax output) that trains on the full 60,000-sample MNIST dataset and evaluates per-digit accuracy on the 10,000-sample test set.
Implemented exp and ln as leaf functions using range reduction and direct IEEE 754 bit manipulation. Roughly 14 digits of precision with zero call overhead in the hot path.
Placed all weights, gradient accumulators, and training data in BSS at fixed addresses. The entire ~445 MB data footprint is predictable and sequential.
Used Xavier initialization, Fisher-Yates shuffling per epoch, mini-batch gradient accumulation (batch size 32), and numerically stable softmax.
Followed the SysV AMD64 ABI throughout: callee-saved registers, 16-byte stack alignment, arguments in the right registers. This was a deliberate bet on interoperability.

Results

96.5% test accuracy after 30 epochs, competitive with equivalent architectures in high-level frameworks. Training finishes in 60–135 seconds depending on hardware.
A companion C HTTP server links five NASM object files for the inference path with no wrappers or marshalling needed. A web app lets you draw digits on a canvas and get predictions in under a millisecond.
The training binary is 27 KB.

What I’d Do Next

Vectorize dot products with AVX2 to measure the real-world speedup in the training loop.
Add a convolutional layer and see what convolution looks like in handwritten assembly.
Experiment with int8 weight quantization on the inference path to test whether accuracy holds with a smaller weight file.