TinyInfer-WASM
A Rust-to-WASM neural network inference engine running directly in the browser: MobileNetV2 in 45ms vs ONNX.js 120ms vs TF.js 180ms, 113KB binary, 94% test coverage.
// DESCRIPTION
The Problem: Browser ML Is Stuck Between Privacy Risk and Slow JS
Running machine learning inference directly in the browser is ideal for privacy-sensitive applications: no user data ever leaves the device, no server round-trips add latency, and the application works offline. The problem is that existing solutions are unsatisfying. Server-side inference sends private data over the network and introduces round-trip latency of 50-500ms. JavaScript-native ML frameworks like ONNX.js and TensorFlow.js work in-browser but are 3-4x slower than native inference — the result of JavaScript's dynamic typing, garbage collector pauses, and inability to exploit hardware SIMD lanes as efficiently as compiled code. WebAssembly changes the equation: compiled from Rust, WASM can access SIMD instructions and run at near-native speed, but building an ML inference engine in Rust targeting WASM requires careful attention to memory layout, operator fusion, and the unique constraints of the browser sandbox.
问题背景:浏览器 ML 困于隐私风险与 JS 性能之间
在浏览器中直接运行机器学习推理是隐私敏感应用的圣杯:用户数据永不离开设备,无服务器往返延迟,应用离线也能运行。ONNX.js 和 TensorFlow.js 可在浏览器内运行,但比原生推理慢 3-4 倍。WebAssembly 改变了这一格局:从 Rust 编译而来的 WASM 可以访问 SIMD 指令并以接近原生的速度运行。
Innovation: Hand-Crafted SIMD MatMul, Operator Fusion, INT8 Quantization
TinyInfer-WASM implements 17+ neural network operators in Rust compiled to WebAssembly. Three key performance innovations:
Hand-crafted SIMD MatMul with tile-based blocking: The core matrix multiplication kernel uses explicit WASM SIMD intrinsics (128-bit vector lanes) with cache-friendly tile-based blocking to maximize L1/L2 cache utilization. Matrix multiplications account for roughly 90% of inference time in transformer and CNN models.
Automatic operator fusion (Conv + BN + ReLU): Rather than executing convolution, batch normalization, and activation as three sequential kernel launches, the engine detects fusable operator sequences at graph load time and executes them as a single fused kernel, eliminating intermediate memory allocations.
INT8 quantization: Post-training quantization to 8-bit integers reduces model size and allows the SIMD kernel to process 4x more values per vector operation compared to 32-bit floats.
Web Workers ensure the main UI thread is never blocked. IndexedDB model caching stores downloaded ONNX models client-side. A 4-stage CI/CD pipeline (lint → test → WASM build → deploy) ensures deployability after every commit.
创新点:手工优化 SIMD 矩阵乘法、算子融合、INT8 量化
基于分块的手工 SIMD 矩阵乘法:核心矩阵乘法内核使用显式 WASM SIMD 内部函数(128 位向量通道)编写,采用缓存友好的分块策略最大化 L1/L2 缓存利用率。
自动算子融合(Conv+BN+ReLU):引擎在图加载时检测可融合的算子序列,将卷积、批归一化和激活作为单一融合内核执行,消除了中间内存分配。
Web Workers 确保主 UI 线程永不被阻塞。IndexedDB 模型缓存在客户端存储下载的 ONNX 模型。
Results: Phase 8 Complete, Deployed at geoffreywtech.me:8080
Inference speed (MobileNetV2, single image):
TinyInfer-WASM: 45ms | ONNX.js: 120ms | TF.js: 180ms — 2.7x speedup over ONNX.js and 4x speedup over TF.js.
Binary size: 113KB WASM — compact enough for fast initial load.
Test coverage: 94% across all 17+ operators via automated CI.
Server memory footprint: 3.8MB — all heavy computation runs client-side.
Live demo: geoffreywtech.me:8080
实验结果:第 8 阶段完成,已部署于 geoffreywtech.me:8080
推理速度对比(MobileNetV2,单张图像):TinyInfer-WASM: 45ms | ONNX.js: 120ms | TF.js: 180ms。
二进制大小:113KB WASM。测试覆盖率:94%。服务器内存占用:3.8MB。
// HIGHLIGHTS
- Phase 8 complete — live demo at geoffreywtech.me:8080
- MobileNetV2 in 45ms vs ONNX.js 120ms vs TF.js 180ms — 2.7-4x speedup in-browser
- Hand-crafted WASM SIMD MatMul with tile-based blocking as core kernel
- Automatic operator fusion (Conv+BN+ReLU) eliminates intermediate memory roundtrips
- INT8 quantization — 4x more values per SIMD lane vs FP32
- 113KB WASM binary with 94% test coverage across 17+ operators
- Web Workers + IndexedDB model caching — non-blocking UI, offline-capable
- 3.8MB server memory — all heavy compute runs client-side in browser