asmqdm: Python Progress Bars Created in x86-64 Assembly
Sole developer · Personal project
systemsassemblyperformancepython ~5ns per update in async mode via LOCK XADDRender thread on a separate core via clone() and sched_setaffinity
Problem / Context
Progress bars in Python add per-iteration overhead that can matter in tight loops. I wanted to see how low that overhead could go if the update path was a single atomic instruction in assembly, with rendering handled by a separate thread on a different CPU core.
Approach
- Implemented the core in x86-64 NASM as a shared library (
libasmqdm.so). All I/O, memory allocation, timing, and thread management use raw Linux syscalls. - Built two modes: sync (renders inline with a 50ms throttle) and async (spawns a render thread via
clone()that polls at ~60fps while the main thread updates a counter withLOCK XADD). - In async mode, the render thread is pinned to a different core than Python using
sched_setaffinity, so the two never contend for the same CPU. - Wrapped the assembly API in a Python package using ctypes. The Python layer handles iterator protocol, context manager, and argument validation. The assembly layer handles everything that touches the terminal or the clock.
- All arithmetic is integer. Time is tracked in nanoseconds, rates in iterations per second, percentages via integer division. No floating-point state to save or restore.
Results
- Async update overhead is roughly 5-10 nanoseconds per call, dominated by the
LOCK XADDinstruction. Sync mode runs at ~500-1000ns with the render throttle. - The Python API is compatible with tqdm’s iterator and context manager patterns, so switching is a one-line change.
- Each progress bar allocates ~650 bytes (state + render buffer) via
mmap. Async mode adds a ~65KB thread stack.
What I’d Do Next
- Add nested progress bar support with cursor positioning.
- Benchmark against tqdm on real workloads (data loading, model training) to measure the practical gap.
- Port the async mode to io_uring for render writes to further reduce syscall overhead.