Running Silero VAD v6 on iOS with onnxruntime-objc

April 1, 2026 · Seth Walluk

ASR iOS ONNX Silero VAD Swift

The Problem

Silero VAD v6 (silero_vad.onnx, 2.2MB, MIT license) loaded and ran on iOS via onnxruntime-objc, but returned speech probabilities near zero on loud, clear audio. maxProb was 0.002 when it should have been 0.95+. The same model on Android and Python worked perfectly.

After a full clinical debugging session, we traced the problem to three compounding root causes — one of which is completely undocumented outside the C++ reference implementation.

The Environment

Component	Detail
Model	silero_vad.onnx v6 (2.2MB, 260K params)
Runtime	onnxruntime-objc (CocoaPods)
Platform	iOS 18, iPhone 16 Pro
Audio	16kHz mono PCM, clear speech, max amplitude 0.97
ONNX inputs	`input` [1, N], `state` [2, 1, 128], `sr` scalar int64
ONNX outputs	`output` [1, 1], `stateN` [2, 1, 128]

The Diagnostic Process

We followed a differential diagnosis approach, testing one hypothesis at a time:

#	Hypothesis	Test	Result
1	sr scalar tensor shape wrong	Changed `shape: []` to `shape: [1]`	maxProb 0.002 → 0.27. Better but insufficient.
2	Half model calibrated differently	Swapped to `silero_vad_half.onnx` (no sr input)	maxProb 0.27, state diverging to [-160, 18]
3	State not updating	Added state range logging per chunk	State IS updating, but growing unbounded
4	Lower threshold to 0.3	Considered	Rejected: masking the problem, not fixing it

Key diagnostic insight: if you need to lower the threshold from the documented 0.5, you have an input problem, not a calibration problem. The state values reaching -160 were the clearest signal that the model was receiving fundamentally malformed input on every frame.

Root Cause 1: Missing 64-Sample Context Prepending (Primary)

This is the big one. Silero VAD v5/v6 requires an input tensor of [1, 576], not [1, 512]. Every inference call needs the last 64 samples from the previous chunk prepended to the current 512-sample audio window.

The C++ reference implementation makes this explicit:

const int context_samples = 64;
window_size_samples = 512;  // user-facing window
effective_window_size = window_size_samples + context_samples;  // 576
input_node_dims[1] = effective_window_size;  // [1, 576]

The Python reference does the same via np.concatenate((self._context, x), axis=1). Context is initialized to 64 zeros on the first call. After each inference, the last 64 samples of the 576-sample input are saved for the next call.

When you feed only 512 samples, the model's internal STFT (filter_length=256, hop_length=128) receives misaligned data. The model interprets the first 64 samples as context continuity from the previous frame — but you're feeding raw audio there instead. This causes exponential state divergence and suppressed probabilities.

This is not documented in the ONNX model metadata. The input dimension is dynamic ([None, None]), so ONNX Runtime happily accepts 512 samples without error. You only find the 576-sample requirement by reading the C++ source code in the examples directory.

Root Cause 2: sr Tensor Shape (Secondary)

The sample rate tensor needs to be shape [1], not a true scalar []. The C++ reference creates it as a 1D tensor:

const int64_t sr_node_dims[1] = { 1 };  // shape [1], NOT scalar []
Ort::Value sr_ort = Ort::Value::CreateTensor<int64_t>(
    memory_info, sr.data(), sr.size(), sr_node_dims, 1);

The Python reference uses np.array(sr, dtype='int64') which creates a 0-dimensional array. ONNX Runtime doesn't validate input rank (confirmed in GitHub issue #19434), so a model expecting a scalar silently accepts shape [1] and vice versa. But the model's internal branching logic (the full silero_vad.onnx contains if-statements to handle both 8kHz and 16kHz) may pick the wrong path with a mismatched rank.

In onnxruntime-objc, creating a true scalar requires passing an empty shape array. Whether shape: [] actually produces a rank-0 tensor or something else in the Objective-C bridge is unclear. Using shape: [1 as NSNumber] is safe and matches the C++ reference.

Root Cause 3: State Deep-Copy Semantics (Tertiary)

NSMutableData has reference semantics. If you create the input state tensor's NSMutableData from the output tensor's data without an explicit deep copy, the state may be corrupted between inference calls. The C++ reference uses explicit memcpy:

float* stateN = ort_outputs[1].GetTensorMutableData<float>();
std::memcpy(_state.data(), stateN, size_state * sizeof(float));

In Swift, the correct pattern is:

let stateNData = try stateNTensor.tensorData()
let bytes = stateNData as Data
bytes.withUnsafeBytes { ptr in
    let floatPtr = ptr.bindMemory(to: Float.self)
    for i in 0..<state.count {
        state[i] = floatPtr[i]
    }
}

The Fix

All three issues addressed in one implementation. The key changes from a naive 512-sample implementation:

import Foundation
import onnxruntime_objc

class SileroVAD {

    private let chunkSize = 512
    private let contextSize = 64
    private let effectiveSize = 576  // 512 + 64
    private let sampleRate: Int64 = 16000
    private let stateSize = 128

    private var session: ORTSession?
    private var env: ORTEnv?
    private var state: [Float]     // [2, 1, 128] = 256 floats
    private var context: [Float]   // rolling 64-sample context

    init() {
        state = [Float](repeating: 0, count: 2 * 1 * stateSize)
        context = [Float](repeating: 0, count: contextSize)
    }

    func reset() {
        state = [Float](repeating: 0, count: 2 * 1 * stateSize)
        context = [Float](repeating: 0, count: contextSize)
    }

    func processChunk(_ chunk: [Float]) -> Float {
        guard initialized, let session = session else { return 0 }
        guard chunk.count == chunkSize else { return 0 }

        do {
            // FIX 1: Build 576-sample input [context(64) | audio(512)]
            var effectiveInput = [Float](repeating: 0, count: effectiveSize)
            for i in 0..<contextSize {
                effectiveInput[i] = context[i]
            }
            for i in 0..<chunkSize {
                effectiveInput[contextSize + i] = chunk[i]
            }

            // Save last 64 samples as context for next call
            for i in 0..<contextSize {
                context[i] = effectiveInput[effectiveSize - contextSize + i]
            }

            // Input tensor: [1, 576] — NOT [1, 512]
            let inputData = Data(bytes: effectiveInput,
                                 count: effectiveSize * MemoryLayout<Float>.size)
            let inputTensor = try ORTValue(
                tensorData: NSMutableData(data: inputData),
                elementType: .float,
                shape: [1, NSNumber(value: effectiveSize)]
            )

            // State tensor: [2, 1, 128]
            let stateData = NSMutableData(bytes: state,
                                          length: state.count * MemoryLayout<Float>.size)
            let stateTensor = try ORTValue(
                tensorData: stateData,
                elementType: .float,
                shape: [2, 1, NSNumber(value: stateSize)]
            )

            // FIX 2: sr as shape [1], NOT scalar []
            var srValue = sampleRate
            let srData = NSMutableData(bytes: &srValue,
                                       length: MemoryLayout<Int64>.size)
            let srTensor = try ORTValue(
                tensorData: srData,
                elementType: .int64,
                shape: [1 as NSNumber]
            )

            let outputs = try session.run(
                withInputs: ["input": inputTensor,
                             "state": stateTensor,
                             "sr": srTensor],
                outputNames: Set(["output", "stateN"]),
                runOptions: nil
            )

            // Read probability
            var prob: Float = 0
            if let out = outputs["output"] {
                let d = try out.tensorData()
                prob = (d as Data).withUnsafeBytes { $0.load(as: Float.self) }
            }

            // FIX 3: Deep-copy state output
            if let stateN = outputs["stateN"] {
                let d = try stateN.tensorData()
                (d as Data).withUnsafeBytes { ptr in
                    let fp = ptr.bindMemory(to: Float.self)
                    let count = min(state.count,
                                    d.count / MemoryLayout<Float>.size)
                    for i in 0..<count { state[i] = fp[i] }
                }
            }

            return prob
        } catch {
            return 0
        }
    }
}

Results

Configuration	maxProb on clear speech
512-sample input, shape [], no deep-copy	0.002
512-sample input, shape [1], no deep-copy	0.27
silero_vad_half.onnx (no sr), 512-sample	0.42 (state diverges to [-160, 18])
576-sample input, shape [1], deep-copy	0.9999

Why Nobody Found This

We surveyed every known iOS implementation of Silero VAD. None of them use onnxruntime-objc:

sherpa-onnx (k2-fsa) — Uses the ONNX Runtime C++ API directly, compiled for iOS via CMake.
RealTimeCutVADLibrary — C++ API through a bridging class. Swift 51%, Objective-C 42%.
FluidInference/FluidAudio — Converts Silero to CoreML, splitting into sub-models. The sr parameter is baked in during conversion.
Flutter VAD package — Dart FFI bindings to the ONNX Runtime C API, bypassing the ObjC wrapper entirely.

The consistent avoidance of onnxruntime-objc suggests real friction with its tensor creation API, particularly for scalar tensors and models with internal branching logic. We proved it can work — but it requires getting three things right simultaneously.

The Takeaway

If your Silero VAD v5 or v6 model produces low probabilities (<0.5) on clear speech, check these three things in order:

Input size must be 576, not 512. Prepend 64 context samples (zeros on first call, last 64 from previous chunk thereafter).
sr tensor shape must be [1], not []. Match the C++ reference, not the Python reference.
Deep-copy the state output. Don't rely on the output tensor's memory persisting between calls.

And if you're considering lowering the detection threshold to compensate for low probabilities — don't. A healthy Silero VAD model produces probabilities above 0.95 on clear speech. If you're not seeing that, you have an input problem.