ONNX Model Quantization on Mobile: What Actually Works (and What Crashes)
The Problem
We needed to ship 525MB of ONNX models (six related models) on both iOS and Android. All inference runs on CPU via ONNX Runtime 1.22.0. Smaller files mean faster downloads, less disk, and faster cold-start loading.
ONNX Runtime's documentation describes several quantization approaches. We tried four. Two crashed on mobile. One gave 50% reduction. One gave 75% reduction with zero quality loss.
The Environment
| Component | Detail |
|---|---|
| Models | 6 ONNX models, 19-115MB each, 525MB total |
| Architecture | Conv1d, ConvTranspose1d, Linear layers (generative model) |
| Runtime | ONNX Runtime 1.22.0, CPU Execution Provider |
| Platforms | Android (ARM64), iOS (ARM64) |
| Goal | Maximum size reduction, zero quality degradation |
Approach 1: Full INT8 via quantize_dynamic() — Crashes
The obvious first attempt. ONNX Runtime provides quantize_dynamic() which converts FP32 weights to INT8 and inserts MatMulInteger / ConvInteger nodes:
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
"model.onnx",
"model_int8.onnx",
weight_type=QuantType.QInt8
)
The quantized model loads and runs on x86 desktop. On Android ARM64 with ORT CPU EP, it crashes:
onnxruntime::ConvInteger - no registered operator for EP
The ARM CPU Execution Provider in ORT 1.22.0 does not support ConvInteger. The model quantizes successfully, produces a valid ONNX file, runs on x86 — and crashes on the actual target platform. No warning during quantization.
Verdict: Dead on arrival for mobile ARM.
Approach 2: FP16 Graph Conversion — Crashes
Convert the entire computation graph to FP16 using onnxconverter_common:
from onnxconverter_common import float16
model = onnx.load("model.onnx")
model_fp16 = float16.convert_float_to_float16(
model,
keep_io_types=True
)
onnx.save(model_fp16, "model_fp16.onnx")
keep_io_types=True preserves FP32 at graph-level inputs and outputs. But intermediate operations are converted to FP16. ORT 1.22.0 CPU EP on ARM cannot execute FP16 intermediate operations:
Type mismatch: input 'X' expects float but got float16
Verdict: Fundamentally incompatible with CPU EP.
Approach 3: FP16 Weight-Only — Works (50%)
Store weights at half precision on disk, insert Cast(to=FLOAT) nodes before each consumer, keep all computation in FP32. This got us a clean 50% reduction (525MB → 266MB) with identical output quality. No exotic operators — Cast is universally supported.
This worked. But we kept going.
Approach 4: INT8 Weight-Only — Works (75%)
Same principle as FP16 weight-only, but one precision level further. Store weight tensors as INT8 on disk with per-tensor scale and zero-point. Insert DequantizeLinear nodes before each consumer to reconstruct FP32 at runtime. All computation remains FP32.
INT8 weight on disk → DequantizeLinear→FP32 → normal FP32 Conv/MatMul → FP32 output
This is not the same as quantize_dynamic(). That replaces Conv nodes with ConvInteger (integer arithmetic). This keeps the original FP32 operators and only changes the storage format of their weight inputs. The operator graph is untouched.
The Script
import onnx
import numpy as np
from onnx import helper, TensorProto, numpy_helper
def quantize_weights_int8(input_path, output_path, min_size_bytes=1024):
"""
Convert large weight tensors to INT8 with per-tensor scale/zero-point.
Insert DequantizeLinear nodes to reconstruct FP32 at runtime.
Skip tiny tensors where overhead exceeds storage savings.
"""
model = onnx.load(input_path)
graph = model.graph
converted = 0
skipped = 0
for init in list(graph.initializer):
if init.data_type != TensorProto.FLOAT:
continue
arr = numpy_helper.to_array(init)
if arr.nbytes < min_size_bytes:
skipped += 1
continue
# Per-tensor symmetric quantization
abs_max = np.max(np.abs(arr))
if abs_max == 0:
skipped += 1
continue
scale = abs_max / 127.0
zero_point = np.int8(0)
arr_int8 = np.clip(np.round(arr / scale), -127, 127).astype(np.int8)
# Create INT8 initializer
int8_name = init.name + "_int8"
int8_init = numpy_helper.from_array(arr_int8, name=int8_name)
graph.initializer.append(int8_init)
# Create scale initializer
scale_name = init.name + "_scale"
scale_init = numpy_helper.from_array(
np.array(scale, dtype=np.float32), name=scale_name)
graph.initializer.append(scale_init)
# Create zero-point initializer
zp_name = init.name + "_zp"
zp_init = numpy_helper.from_array(
np.array(zero_point, dtype=np.int8), name=zp_name)
graph.initializer.append(zp_init)
# Create DequantizeLinear node
deq_output = init.name + "_deq"
deq_node = helper.make_node(
"DequantizeLinear",
inputs=[int8_name, scale_name, zp_name],
outputs=[deq_output]
)
graph.node.insert(0, deq_node)
# Rewire consumers
for node in graph.node:
for j, inp in enumerate(node.input):
if inp == init.name:
node.input[j] = deq_output
graph.initializer.remove(init)
converted += 1
print(f"Converted: {converted}, Skipped: {skipped}")
onnx.save(model, output_path)
quantize_weights_int8("model.onnx", "model_int8w.onnx")
Why Skip Tiny Tensors?
A 512-byte bias tensor saves 384 bytes by converting to INT8. But the DequantizeLinear node plus the scale and zero-point initializers add overhead to the graph. For tensors under 1KB, the savings aren't worth the graph complexity. The min_size_bytes=1024 threshold catches biases and small embeddings while converting all the large weight matrices that dominate file size.
Results
| Model | FP32 | FP16 Weight-Only | INT8 Weight-Only | % of Original |
|---|---|---|---|---|
| Model A (multi-lang) | 115 MB | 58 MB | 28 MB | 24.4% |
| Model B | 60 MB | 31 MB | 15 MB | 24.8% |
| Model C | 64 MB | 32 MB | 16 MB | 25.1% |
| Model D | 111 MB | 56 MB | 28 MB | 25.2% |
| Model E | 64 MB | 32 MB | 16 MB | 25.1% |
| Model F | 111 MB | 56 MB | 28 MB | 25.2% |
| Total | 525 MB | 266 MB | 131 MB | ~25% |
Quality
Perceptually identical to FP32 on all tested inputs. The per-tensor symmetric quantization preserves the weight distribution well enough that the FP32 reconstruction via DequantizeLinear introduces no audible or measurable degradation. Validated through automated A/B generation and manual testing on-device.
Compatibility
| Platform | Status |
|---|---|
| Android ARM64 (ORT 1.22.0 CPU EP) | Works |
| iOS ARM64 (onnxruntime-objc) | Works |
| x86-64 Linux (ORT 1.22.0) | Works |
| x86-64 macOS | Works |
Why This Isn't in the Docs
ONNX Runtime's quantization documentation focuses on two paths: quantize_dynamic() (INT8 with integer arithmetic operators) and quantize_static() (INT8 with calibration). Both produce models with ConvInteger, MatMulInteger, or QLinearConv nodes. These work on x86 with AVX/VNNI and on GPUs with Tensor Cores.
On mobile ARM CPUs, integer arithmetic operators have limited or no support in the CPU EP. The documentation doesn't clearly state this. You discover it at runtime when the model crashes on a real device.
The weight-only approach isn't "quantization" in the traditional sense — it's a storage optimization. No precision is lost during the actual computation. The ONNX ecosystem doesn't have a standard tool for it because it's a 50-line Python script, not a research contribution. But it's the only approach that actually ships on mobile.
The Failure Mode Map
| Approach | x86 CPU | ARM CPU (Mobile) | Size Reduction | Quality |
|---|---|---|---|---|
| Full INT8 (quantize_dynamic) | Works | Crashes | ~67% | Slight loss |
| FP16 graph conversion | Crashes | Crashes | ~50% | Slight loss |
| FP16 weight-only + Cast | Works | Works | ~50% | Identical |
| INT8 weight-only + DequantizeLinear | Works | Works | ~75% | Identical |
Start with INT8 weight-only. If you notice quality degradation on your specific model, fall back to FP16 weight-only. In our case across six models of varying sizes and architectures, INT8 weight-only produced identical output to FP32.
One Gotcha: OrtEnvironment Singleton
If you're loading multiple ONNX models in the same app, never call .close() on the OrtEnvironment on Android or ORTEnv on iOS. It's a process-level singleton. Closing it kills all other ORT sessions in the app, including ones you didn't intend to affect. Initialize it once at app startup, keep it alive forever.
// Android (Kotlin) — do this ONCE
val env = OrtEnvironment.getEnvironment()
// Never call env.close()
// iOS (Swift) — do this ONCE
let env = try ORTEnv(loggingLevel: .warning)
// Never release env