Happy SIMD in Rust (without getting a stroke)

Dear Rust / low-level fans — today’s post is a “how to SIMD in Rust” style brain dump.

If you haven’t met the term SIMD yet: it stands for Single Instruction, Multiple Data. The idea is that the CPU can run the same operation over multiple values at once by using vector registers. A register is a tiny, ultra-fast storage slot inside the CPU (much faster than RAM, faster than cache too) that holds operands for instructions. SIMD uses wider registers (and instructions that operate on them) to process multiple “lanes” in parallel.

SIMD isn’t one single thing — it shows up as architecture-specific instruction set extensions like NEON (ARM), SSE/AVX/AVX2/AVX-512 (x86), and friends. If you open the docs for these intrinsics the first time, you might briefly see your life flash before your eyes… but in the end it’s usually less unreadable than it looks (speaking from personal experience).

SIMD is powerful, but not a universal hammer. Some code benefits a lot, some barely moves, and sometimes you lose time due to overhead, branching, or memory bottlenecks. Still, it’s worth having in your toolbox.

Three ways I write SIMD in Rust

In practice, I usually SIMD in Rust in three “levels”:

The high-level crate approach via wide
The (currently unstable) standard-library approach via std::simd / portable SIMD
The “hardcore” approach via core::arch intrinsics (fully architecture-dependent, typically unsafe)

The first two mostly save you from writing three times the code for three different architectures. The third one gives you full control — and full responsibility.

Let’s look at the “veterinarian’s horse” example: adding two vectors.

The simplest example: adding two 8-lane vectors

We’ll use f64 in the code snippets. Later I’ll also show what went wrong in the u64 NEON case.

const A: [f64; 8] = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
const B: [f64; 8] = [9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0];

fn no_simd() {
    let mut out = [0_f64; 8];
    for i in 0..8 {
        out[i] = A[i] + B[i];
    }
}

That’s our baseline.

Using `wide`

fn wide_simd() {
    let a = wide::f64x8::from(A);
    let b = wide::f64x8::from(B);
    let _ = a + b;
}

Using `std::simd` (portable SIMD; currently unstable)

fn portable_simd() {
    let a = std::simd::f64x8::from_array(A);
    let b = std::simd::f64x8::from_array(B);
    let _ = a + b;
}

Using architecture intrinsics (ARM NEON)

unsafe fn simd_neon_f64() {
    let mut out = [0_f64; 8];
    for i in (0..8).step_by(2) {
        let a = std::arch::aarch64::vld1q_f64(A.as_ptr().add(i));
        let b = std::arch::aarch64::vld1q_f64(B.as_ptr().add(i));
        let sum = std::arch::aarch64::vaddq_f64(a, b);
        std::arch::aarch64::vst1q_f64(out.as_mut_ptr().add(i), sum);
    }
}

That step_by(2) is the key tell: ARM NEON is 128-bit wide, and f64 is 64-bit, so you get 2 lanes per vector register. In other words, on NEON you naturally operate on f64x2. When you write f64x8 using wide or std::simd, the compiler typically lowers that to multiple 128-bit operations “under the hood”.

On x86 with AVX2 (256-bit) you’d get 4 lanes of f64, and with AVX-512 you’d get 8 lanes — meaning the same “shape” of code can scale to wider machines.

Results (Apple M4 / ARM64)

You provided these timings, so I’m keeping them as-is — with the usual microbenchmark caveat: for tiny loops like this, measurement noise, inlining, and optimizer effects can dominate. Still, it’s a nice sanity check.

Method	f64 output	f64 time	u64 output	u64 time
no_simd	`[10, 12, 14, 16, 18, 20, 22, 24]`	0.004583	`[10, 12, 14, 16, 18, 20, 22, 24]`	0.002792
wide	`[10, 12, 14, 16, 18, 20, 22, 24]`	0.00425	`[10, 12, 14, 16, 18, 20, 22, 24]`	0.002583
portable_simd	`[10, 12, 14, 16, 18, 20, 22, 24]`	0.00375	`[10, 12, 14, 16, 18, 20, 22, 24]`	0.002542
simd_neon	`[10, 12, 14, 16, 18, 20, 22, 24]`	0.004292	`[8, 0, 8, 0, 8, 0, 24, 0]`	0.002958

The interesting part isn’t that “SIMD didn’t win hard” here — that’s totally expected on an 8-element toy loop where the compiler can do clever things anyway. The interesting part is the u64 NEON result being wrong.

That output strongly suggests the integer NEON version was implemented with the wrong intrinsics (for example, mixing float ops with integer data, reinterpreting types incorrectly, or storing with the wrong lane width). For u64 on NEON you want the u64 intrinsics, like this:

unsafe fn simd_neon_u64(a: &[u64; 8], b: &[u64; 8]) -> [u64; 8] {
    let mut out = [0u64; 8];
    for i in (0..8).step_by(2) {
        let va = std::arch::aarch64::vld1q_u64(a.as_ptr().add(i));
        let vb = std::arch::aarch64::vld1q_u64(b.as_ptr().add(i));
        let vs = std::arch::aarch64::vaddq_u64(va, vb);
        std::arch::aarch64::vst1q_u64(out.as_mut_ptr().add(i), vs);
    }
    out
}

So: one of the “hidden costs” of going full core::arch is that it’s very easy to write something that compiles and still produces garbage if you pick the wrong intrinsic or interpret the lanes incorrectly.

So when is SIMD worth it?

A lot of scalar integer and floating-point operations are already highly optimized, and LLVM can auto-vectorize some patterns. So if you’re reaching for SIMD just to “SIMD all the things”, you may only buy yourself pain: new API surface, more unsafe, more thinking about memory layout, and more time spent benchmarking.

But when you hit the right workload — large arrays, predictable access patterns, math-heavy kernels, tight loops that dominate runtime — SIMD can be a serious win. And in real projects you often care less about “this 8-element add” and more about “this kernel runs 10 million times”.

In stochastic-rs, where we’re moving more and more toward SIMD-accelerated algorithms, we currently build heavily on wide, but I’m not married to it. If/when std::simd becomes stable, it can become an attractive drop-in option (especially to reduce dependencies) — but the only honest way to choose is: benchmark, benchmark, benchmark.

A more interesting example: Ziggurat normal sampling (SIMD-ish)

Now for something that actually smells like a SIMD target: a simplified “SIMD-ish” version of a Ziggurat standard normal generator (in the spirit of what you’d do in stochastic-rs).

The Ziggurat method works by slicing the target density (here, the standard normal) into many horizontal layers (often 128). Each layer forms a rectangle under the curve. Sampling tries to land inside a rectangle using mostly integer operations and table lookups. Most samples are accepted quickly via a fast integer comparison. Rarely, you hit a “wedge” region or the tail, and then you do the slower math (ln, exp) to fix things up.

That split is exactly where SIMD can help: the fast accept test is cheap and uniform, while the slow path is branchy and expensive.

1) Precomputed tables (one-time initialization)

This builds the classic Ziggurat tables: kn for fast accept thresholds, wn for scaling, and fn_tab for acceptance in the wedge case. The values are computed once and cached.

use std::sync::OnceLock;

/// Precomputed Ziggurat tables (128 layers).
struct ZigTables {
    kn: [i32; 128],
    wn: [f64; 128],
    fn_tab: [f64; 128],
}

static ZIG_TABLES: OnceLock<ZigTables> = OnceLock::new();

fn zig_tables() -> &'static ZigTables {
    ZIG_TABLES.get_or_init(|| {
        let mut kn = [0i32; 128];
        let mut wn = [0.0f64; 128];
        let mut fn_tab = [0.0f64; 128];

        let mut dn = 3.442619855899f64;
        let vn = 9.91256303526217e-3f64;
        let m1 = 2147483648.0f64;

        let q = vn / (-0.5 * dn * dn).exp();
        let kn0 = (dn / q) * m1;
        kn[0] = if kn0 > i32::MAX as f64 { i32::MAX } else { kn0 as i32 };
        kn[1] = 0;

        wn[0] = q / m1;
        wn[127] = dn / m1;

        fn_tab[0] = 1.0;
        fn_tab[127] = (-0.5 * dn * dn).exp();

        let mut tn = dn;
        for i in (1..=126).rev() {
            dn = (-2.0 * (vn / dn + (-0.5 * dn * dn).exp()).ln()).sqrt();
            let kn_val = (dn / tn) * m1;
            kn[i + 1] = if kn_val > i32::MAX as f64 { i32::MAX } else { kn_val as i32 };
            tn = dn;

            fn_tab[i] = (-0.5 * dn * dn).exp();
            wn[i] = dn / m1;
        }

        ZigTables { kn, wn, fn_tab }
    })
}

This table generation isn’t the hot path; it’s setup. If you want to go even harder, you can generate these at build time, but for most use cases OnceLock is clean and fast enough.

2) The rare-case fixup (`nfix`): tails + wedge acceptance

Most of the time, Ziggurat sampling avoids expensive math. But when the fast accept test fails, you need a “fixup” that handles the tricky parts: tail sampling and wedge acceptance.

use rand::Rng;

#[cold]
#[inline(never)]
fn nfix<R: Rng + ?Sized>(mut hz: i32, mut iz: usize, tables: &ZigTables, rng: &mut R) -> f64 {
    const R_TAIL: f64 = 3.442620;

    loop {
        let x = hz as f64 * tables.wn[iz];

        // Tail case
        if iz == 0 {
            loop {
                let u1: f64 = rng.random(); // random in [0,1)
                let u2: f64 = rng.random();
                let x_tail = -0.2904764 * (-u1.ln());
                let y = -u2.ln();
                if y + y >= x_tail * x_tail {
                    return if hz > 0 { R_TAIL + x_tail } else { -R_TAIL - x_tail };
                }
            }
        }

        // Wedge / acceptance test
        let u: f64 = rng.random();
        if tables.fn_tab[iz] + u * (tables.fn_tab[iz - 1] - tables.fn_tab[iz])
            < (-0.5 * x * x).exp()
        {
            return x;
        }

        // Resample: hz is just fresh random bits
        hz = rng.random::<i32>();
        iz = (hz & 127) as usize;

        if (hz.unsigned_abs() as i64) < tables.kn[iz] as i64 {
            return hz as f64 * tables.wn[iz];
        }
    }
}

Any place you see rng.random::<...>(), that’s simply “give me more randomness” — the algorithm uses those random bits either as an integer hz (for fast acceptance) or as uniforms u (for wedge/tail logic).

3) Scalar standard normal: fast accept first, fixup rarely

fn sample_standard<R: Rng + ?Sized>(rng: &mut R) -> f64 {
    let tables = zig_tables();
    let hz: i32 = rng.random();               // raw random bits
    let iz: usize = (hz & 127) as usize;      // layer index in [0,127]

    // Fast accept: pure integer compare, no exp/ln here.
    if (hz.unsigned_abs() as i64) < tables.kn[iz] as i64 {
        hz as f64 * tables.wn[iz]
    } else {
        nfix(hz, iz, tables, rng)
    }
}

This is the heart of why Ziggurat is fast: most samples finish in that integer compare + one multiply.

4) “SIMD-ish” fill: vectorize the accept test (8 lanes)

Now the fun part: filling an output slice. Each normal sample is independent, which makes this embarrassingly parallel. The catch is rejection: different lanes may accept or reject, so you want a fast “all accepted” path, and a fallback that fixes up only the rejected lanes.

This version uses wide’s i32x8 to vectorize the accept test. The actual “random generation” here is still scalar (we call rng.random() eight times), but the comparison logic and branching are shaped to benefit from SIMD.

use rand_distr::Distribution;
use wide::{i32x8, CmpLt};

#[derive(Clone, Copy, Debug)]
pub struct ZigguratNormal {
    mean: f64,
    std_dev: f64,
}

impl ZigguratNormal {
    pub fn new(mean: f64, std_dev: f64) -> Self {
        assert!(std_dev > 0.0);
        let _ = zig_tables(); // init tables once
        Self { mean, std_dev }
    }

    #[inline]
    pub fn sample_with<R: rand::Rng + ?Sized>(&self, rng: &mut R) -> f64 {
        self.mean + self.std_dev * sample_standard(rng)
    }

    pub fn fill_slice<R: rand::Rng + ?Sized>(&self, rng: &mut R, out: &mut [f64]) {
        for x in out.iter_mut() {
            *x = self.sample_with(rng);
        }
    }

    /// “SIMD-ish” filler: vectorized accept test (8 lanes) using wide::i32x8,
    /// with scalar fixup per rejected lane.
    pub fn fill_slice_simd<R: rand::Rng + ?Sized>(&self, rng: &mut R, out: &mut [f64]) {
        let tables = zig_tables();
        let mask127 = i32x8::splat(127);

        let mut i = 0usize;
        while i < out.len() {
            // hz is just 8 fresh i32 random values packed into a vector
            let hz = next_i32x8(rng);
            let iz = hz & mask127;

            let hz_arr = hz.to_array();
            let iz_arr = iz.to_array();

            // Gather kn[iz] values (loads are scalar, but the code stays compact)
            let kn_vals = i32x8::new([
                tables.kn[iz_arr[0] as usize],
                tables.kn[iz_arr[1] as usize],
                tables.kn[iz_arr[2] as usize],
                tables.kn[iz_arr[3] as usize],
                tables.kn[iz_arr[4] as usize],
                tables.kn[iz_arr[5] as usize],
                tables.kn[iz_arr[6] as usize],
                tables.kn[iz_arr[7] as usize],
            ]);

            let abs_hz = hz.abs(); // i32::MIN edge case is astronomically rare
            let accept = abs_hz.simd_lt(kn_vals);

            if accept.all() {
                // Fast path: no per-lane branches, everyone accepted
                for lane in 0..8 {
                    if i >= out.len() { break; }
                    let idx = iz_arr[lane] as usize;
                    let x = hz_arr[lane] as f64 * tables.wn[idx];
                    out[i] = self.mean + self.std_dev * x;
                    i += 1;
                }
            } else {
                // Slow path: only rejected lanes pay the expensive fixup
                let accept_arr = accept.to_array();
                for lane in 0..8 {
                    if i >= out.len() { break; }
                    let idx = iz_arr[lane] as usize;
                    let hz_lane = hz_arr[lane];

                    let z = if accept_arr[lane] != 0 {
                        hz_lane as f64 * tables.wn[idx]
                    } else {
                        nfix(hz_lane, idx, tables, rng)
                    };

                    out[i] = self.mean + self.std_dev * z;
                    i += 1;
                }
            }
        }
    }
}

impl Distribution<f64> for ZigguratNormal {
    fn sample<R: rand::Rng + ?Sized>(&self, rng: &mut R) -> f64 {
        self.sample_with(rng)
    }
}

fn next_i32x8<R: rand::Rng + ?Sized>(rng: &mut R) -> i32x8 {
    i32x8::new([
        rng.random::<i32>(),
        rng.random::<i32>(),
        rng.random::<i32>(),
        rng.random::<i32>(),
        rng.random::<i32>(),
        rng.random::<i32>(),
        rng.random::<i32>(),
        rng.random::<i32>(),
    ])
}

What’s SIMD-friendly here is the “bulk decision”: the acceptance test is an integer vector compare across 8 lanes. When the mask says “all accepted”, you avoid per-lane branching and you never call the expensive nfix. When not all accept, you still win because you only pay the heavy stuff for the lanes that need it.

Also notice what’s not SIMD yet: the RNG fetch (rng.random()) and the table gather (kn[iz]) are scalar in this minimal version. If you want to push further, you typically move to a RNG that can generate a vector of random bits efficiently (e.g. a SIMD-friendly xoshiro/xoroshiro variant), and you design your tables/memory layout to make gathers cheaper — or you accept that the fast path dominates anyway.

Wrapping up

SIMD is one of those tools that can feel intimidating at first, but Rust makes the on-ramp much nicer than it used to be. Starting with wide or the (eventually stable) std::simd model helps you learn the mental model—lanes, masks, fast vs slow paths—without immediately living in the world of architecture intrinsics and unsafe.

And if you ever do need to go full metal, core::arch is there… just bring benchmarks and a healthy fear of “it compiles” not meaning “it’s correct”.

Happy SIMD in Rust ❤️🦀