POLYGLOTTOS ← Blog

ONNX Model Quantization on Mobile: What Actually Works (and What Crashes)

The Problem

We needed to ship 525MB of ONNX models (six related models) on both iOS and Android. All inference runs on CPU via ONNX Runtime 1.22.0. Smaller files mean faster downloads, less disk, and faster cold-start loading.

ONNX Runtime's documentation describes several quantization approaches. We tried four. Two crashed on mobile. One gave 50% reduction. One gave 75% reduction with zero quality loss.

The Environment

ComponentDetail
Models6 ONNX models, 19-115MB each, 525MB total
ArchitectureConv1d, ConvTranspose1d, Linear layers (generative model)
RuntimeONNX Runtime 1.22.0, CPU Execution Provider
PlatformsAndroid (ARM64), iOS (ARM64)
GoalMaximum size reduction, zero quality degradation

Approach 1: Full INT8 via quantize_dynamic() — Crashes

The obvious first attempt. ONNX Runtime provides quantize_dynamic() which converts FP32 weights to INT8 and inserts MatMulInteger / ConvInteger nodes:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    "model.onnx",
    "model_int8.onnx",
    weight_type=QuantType.QInt8
)

The quantized model loads and runs on x86 desktop. On Android ARM64 with ORT CPU EP, it crashes:

onnxruntime::ConvInteger - no registered operator for EP

The ARM CPU Execution Provider in ORT 1.22.0 does not support ConvInteger. The model quantizes successfully, produces a valid ONNX file, runs on x86 — and crashes on the actual target platform. No warning during quantization.

Verdict: Dead on arrival for mobile ARM.

Approach 2: FP16 Graph Conversion — Crashes

Convert the entire computation graph to FP16 using onnxconverter_common:

from onnxconverter_common import float16

model = onnx.load("model.onnx")
model_fp16 = float16.convert_float_to_float16(
    model,
    keep_io_types=True
)
onnx.save(model_fp16, "model_fp16.onnx")

keep_io_types=True preserves FP32 at graph-level inputs and outputs. But intermediate operations are converted to FP16. ORT 1.22.0 CPU EP on ARM cannot execute FP16 intermediate operations:

Type mismatch: input 'X' expects float but got float16

Verdict: Fundamentally incompatible with CPU EP.

Approach 3: FP16 Weight-Only — Works (50%)

Store weights at half precision on disk, insert Cast(to=FLOAT) nodes before each consumer, keep all computation in FP32. This got us a clean 50% reduction (525MB → 266MB) with identical output quality. No exotic operators — Cast is universally supported.

This worked. But we kept going.

Approach 4: INT8 Weight-Only — Works (75%)

Same principle as FP16 weight-only, but one precision level further. Store weight tensors as INT8 on disk with per-tensor scale and zero-point. Insert DequantizeLinear nodes before each consumer to reconstruct FP32 at runtime. All computation remains FP32.

INT8 weight on disk → DequantizeLinear→FP32 → normal FP32 Conv/MatMul → FP32 output

This is not the same as quantize_dynamic(). That replaces Conv nodes with ConvInteger (integer arithmetic). This keeps the original FP32 operators and only changes the storage format of their weight inputs. The operator graph is untouched.

The Script

import onnx
import numpy as np
from onnx import helper, TensorProto, numpy_helper

def quantize_weights_int8(input_path, output_path, min_size_bytes=1024):
    """
    Convert large weight tensors to INT8 with per-tensor scale/zero-point.
    Insert DequantizeLinear nodes to reconstruct FP32 at runtime.
    Skip tiny tensors where overhead exceeds storage savings.
    """
    model = onnx.load(input_path)
    graph = model.graph
    converted = 0
    skipped = 0

    for init in list(graph.initializer):
        if init.data_type != TensorProto.FLOAT:
            continue

        arr = numpy_helper.to_array(init)
        if arr.nbytes < min_size_bytes:
            skipped += 1
            continue

        # Per-tensor symmetric quantization
        abs_max = np.max(np.abs(arr))
        if abs_max == 0:
            skipped += 1
            continue
        scale = abs_max / 127.0
        zero_point = np.int8(0)
        arr_int8 = np.clip(np.round(arr / scale), -127, 127).astype(np.int8)

        # Create INT8 initializer
        int8_name = init.name + "_int8"
        int8_init = numpy_helper.from_array(arr_int8, name=int8_name)
        graph.initializer.append(int8_init)

        # Create scale initializer
        scale_name = init.name + "_scale"
        scale_init = numpy_helper.from_array(
            np.array(scale, dtype=np.float32), name=scale_name)
        graph.initializer.append(scale_init)

        # Create zero-point initializer
        zp_name = init.name + "_zp"
        zp_init = numpy_helper.from_array(
            np.array(zero_point, dtype=np.int8), name=zp_name)
        graph.initializer.append(zp_init)

        # Create DequantizeLinear node
        deq_output = init.name + "_deq"
        deq_node = helper.make_node(
            "DequantizeLinear",
            inputs=[int8_name, scale_name, zp_name],
            outputs=[deq_output]
        )
        graph.node.insert(0, deq_node)

        # Rewire consumers
        for node in graph.node:
            for j, inp in enumerate(node.input):
                if inp == init.name:
                    node.input[j] = deq_output

        graph.initializer.remove(init)
        converted += 1

    print(f"Converted: {converted}, Skipped: {skipped}")
    onnx.save(model, output_path)

quantize_weights_int8("model.onnx", "model_int8w.onnx")

Why Skip Tiny Tensors?

A 512-byte bias tensor saves 384 bytes by converting to INT8. But the DequantizeLinear node plus the scale and zero-point initializers add overhead to the graph. For tensors under 1KB, the savings aren't worth the graph complexity. The min_size_bytes=1024 threshold catches biases and small embeddings while converting all the large weight matrices that dominate file size.

Results

ModelFP32FP16 Weight-OnlyINT8 Weight-Only% of Original
Model A (multi-lang)115 MB58 MB28 MB24.4%
Model B60 MB31 MB15 MB24.8%
Model C64 MB32 MB16 MB25.1%
Model D111 MB56 MB28 MB25.2%
Model E64 MB32 MB16 MB25.1%
Model F111 MB56 MB28 MB25.2%
Total525 MB266 MB131 MB~25%

Quality

Perceptually identical to FP32 on all tested inputs. The per-tensor symmetric quantization preserves the weight distribution well enough that the FP32 reconstruction via DequantizeLinear introduces no audible or measurable degradation. Validated through automated A/B generation and manual testing on-device.

Compatibility

PlatformStatus
Android ARM64 (ORT 1.22.0 CPU EP)Works
iOS ARM64 (onnxruntime-objc)Works
x86-64 Linux (ORT 1.22.0)Works
x86-64 macOSWorks

Why This Isn't in the Docs

ONNX Runtime's quantization documentation focuses on two paths: quantize_dynamic() (INT8 with integer arithmetic operators) and quantize_static() (INT8 with calibration). Both produce models with ConvInteger, MatMulInteger, or QLinearConv nodes. These work on x86 with AVX/VNNI and on GPUs with Tensor Cores.

On mobile ARM CPUs, integer arithmetic operators have limited or no support in the CPU EP. The documentation doesn't clearly state this. You discover it at runtime when the model crashes on a real device.

The weight-only approach isn't "quantization" in the traditional sense — it's a storage optimization. No precision is lost during the actual computation. The ONNX ecosystem doesn't have a standard tool for it because it's a 50-line Python script, not a research contribution. But it's the only approach that actually ships on mobile.

The Failure Mode Map

Approachx86 CPUARM CPU (Mobile)Size ReductionQuality
Full INT8 (quantize_dynamic)WorksCrashes~67%Slight loss
FP16 graph conversionCrashesCrashes~50%Slight loss
FP16 weight-only + CastWorksWorks~50%Identical
INT8 weight-only + DequantizeLinearWorksWorks~75%Identical

Start with INT8 weight-only. If you notice quality degradation on your specific model, fall back to FP16 weight-only. In our case across six models of varying sizes and architectures, INT8 weight-only produced identical output to FP32.

One Gotcha: OrtEnvironment Singleton

If you're loading multiple ONNX models in the same app, never call .close() on the OrtEnvironment on Android or ORTEnv on iOS. It's a process-level singleton. Closing it kills all other ORT sessions in the app, including ones you didn't intend to affect. Initialize it once at app startup, keep it alive forever.

// Android (Kotlin) — do this ONCE
val env = OrtEnvironment.getEnvironment()
// Never call env.close()

// iOS (Swift) — do this ONCE
let env = try ORTEnv(loggingLevel: .warning)
// Never release env