stochastic-rs v1 stable
by Daniel Boros
Feb 14
8 min read
8 views
SIMD Random Number Generation
The library now ships its own SIMD-accelerated PRNG in simd_rng::SimdRng. It runs two independent engines:
- Xoshiro256++ ×4 for f64 — operates on
u64x4(from thewidecrate), produces 4 u64 values per state advance, converts to f64 by right-shifting 11 bits and scaling by1.0 / (1u64 << 53) - Xoshiro128++ ×8 for f32 — operates on
u32x8, produces 8 u32 values per state advance, converts to f32 by right-shifting 8 bits and scaling by1.0 / (1u32 << 24)
Both are seeded from rand::rng() at construction. SimdRng implements RngCore, so it can be used as a drop-in replacement for any standard rng, but the primary interface is next_f64_array() -> [f64; 8] and next_f32_array() -> [f32; 8] for batch consumption. There's also next_i32x8() -> i32x8 used directly by the Ziggurat samplers to avoid the float conversion step entirely.
SIMD Ziggurat Algorithm (Normal, Exp)
SimdNormal<T: SimdFloatExt, const N: usize> implements the Ziggurat method with a SIMD fast path. The algorithm per iteration:
- Draw 8 random
i32values viaSimdRng::next_i32x8() - Compute
iz = hz & 127(table index, 128 rectangles) - Gather 8
kn[iz]threshold values into ani32x8 - SIMD compare:
abs(hz) < kn[iz]→ produces an 8-lane mask - If all 8 accepted (
accept.all()): computeresult = hz_float * wn[iz]in SIMD, apply mean/stddev scaling viamean_simd + std_dev_simd * result, write 8 values to output buffer at once - If any rejected: fall through to scalar loop — accepted lanes get the SIMD result, rejected lanes call
nfix()which handles the tail (Marsaglia's tail algorithm foriz == 0, rejection sampling for rectangle edges)
The Ziggurat tables (kn, wn, fn_tab — 128 entries each) are computed once via OnceLock and shared across all instances.
There are two codepaths: fill_ziggurat() applies mean+stddev scaling, fill_ziggurat_standard() skips it (outputs N(0,1) directly). The standard variant is what FGN uses internally.
The const generic N controls the internal buffer size. When calling distribution.sample(&mut rng) (single value), it actually refills the entire N-element buffer via the SIMD path and returns values from it one by one. fill_slice() and fill_standard_fast() bypass the buffer entirely and write straight into the caller's slice.
SimdExpZig follows the same pattern but with 256 rectangles (iz = hz & 0xFF) and exponential tail handling (efix()).
Other Distributions
The remaining distributions build on top of SimdNormal, SimdExpZig, and SimdRng through different strategies:
Direct SIMD transform — distributions that are a closed-form function of uniform samples. These process 8 values per iteration through SIMD math:
SimdCauchy: generates 8 uniforms viafill_uniform_simd, computesx0 + gamma * tan(π * (u - 0.5))in SIMD usingsimd_tanSimdPareto:scale * u^(-1/alpha)viasimd_powfSimdLogNormal: draws fromSimdNormal, appliessimd_expSimdWeibull,SimdUniform: similar transform patterns
Rejection sampling with SIMD base — distributions that need accept/reject loops but use SIMD-accelerated normals/uniforms internally:
SimdGamma: Marsaglia & Tsang's method — draws normal samples via the internalSimdNormal, uniform checks viaSimdRng. Foralpha < 1, uses thealpha + 1trick with a power-of-uniform correctionSimdBeta: generates two independent Gamma samples, computesx / (x + y)SimdChiSquared: delegates toSimdGamma(k/2, 2.0)SimdStudentT:normal / sqrt(chi_squared / k)SimdPoisson: Knuth's algorithm withSimdExpZig-accelerated exponential draws
All distributions use a 16-element internal buffer (UnsafeCell<[T; 16]>) and refill via their respective fill_slice_fast() when exhausted. They implement rand_distr::Distribution<T>, so they're compatible with the standard dist.sample(&mut rng) API.
Benchmarks vs rand_distr
Measured with Criterion on Apple M-series, --release, 100K samples:
- Normal f64 N=32: 201µs vs 973µs → 4.84x
- Exp f64 N=32: 188µs vs 924µs → 4.91x
- Beta f64: 1129µs vs 4652µs → 4.12x
- Cauchy f32: 231µs vs 1010µs → 4.37x
- StudentT f64: 784µs vs 2332µs → 2.97x
- Gamma f64: 560µs vs 1490µs → 2.66x
- LogNormal f64: 468µs vs 1284µs → 2.74x
- ChiSquared f64: 545µs vs 1478µs → 2.71x
- Poisson u32: 2166µs vs 4235µs → 1.96x
- Weibull f64: 1025µs vs 1510µs → 1.47x
- Uniform: ~1:1 (already bottlenecked by RNG throughput)
The biggest wins are on Normal/Exp (Ziggurat fast path) and distributions that compose them (Beta = two Gammas, each calling Normal internally). Distributions that are dominated by transcendental functions (Weibull's powf) see smaller gains since the SIMD math functions themselves are the bottleneck.
Generic Precision — <T: FloatExt>
The f32 and f64 feature flags are removed. All models are generic over T: FloatExt. The trait hierarchy:
pub trait SimdFloatExt: num_traits::Float + Default + Send + Sync + 'static {
type Simd: Copy + Mul<Output = Self::Simd> + Add<Output = Self::Simd> + ...;
fn splat(val: Self) -> Self::Simd;
fn simd_from_array(arr: [Self; 8]) -> Self::Simd;
fn simd_to_array(v: Self::Simd) -> [Self; 8];
fn simd_ln(v: Self::Simd) -> Self::Simd;
fn simd_sqrt(v: Self::Simd) -> Self::Simd;
fn simd_exp(v: Self::Simd) -> Self::Simd;
fn fill_uniform_simd(rng: &mut SimdRng, out: &mut [Self]);
fn simd_from_i32x8(v: i32x8) -> Self::Simd;
fn from_f64_fast(v: f64) -> Self;
// ...
}
SimdFloatExt is implemented for f32 (where Simd = f32x8) and f64 (where Simd = f64x8). This is what allows a single generic fill_ziggurat<T> to emit SIMD code for both precisions — the trait dispatches to the right wide type at compile time.
FloatExt extends SimdFloatExt with num_traits::FromPrimitive, FloatConst, ScalarOperand, AddAssign, SubAssign, and provides from_usize_() and normal_array(). Every stochastic process is bounded by T: FloatExt.
The from_f64_fast(v: f64) -> Self method is used throughout to convert compile-time constants (like 0.5, 0.0331 in the Gamma sampler) without going through FromPrimitive's Option — for f64 it's identity, for f32 it's a cast.
Trait System
pub trait ProcessExt<T: FloatExt>: Send + Sync {
type Output: Send;
fn sample(&self) -> Self::Output;
fn sample_par(&self, m: usize) -> Vec<Self::Output> {
(0..m).into_par_iter().map(|_| self.sample()).collect()
}
#[cfg(feature = "cuda")]
fn sample_cuda(&self, _m: usize) -> Result<Either<Array1<f64>, Array2<f64>>> { ... }
}
ProcessExt requires Send + Sync so sample_par can call sample() from rayon worker threads. The Output type is Array1<T> for single-component processes (GBM, OU, FGN, etc.) and [Array1<T>; 2] for two-component processes (Heston returns [price, variance], SABR returns [forward, vol], correlated Gaussian noise returns [gn1, gn2]).
DistributionExt provides analytical distribution properties for processes that have them. GBM implements it by wrapping statrs::LogNormal — at construction time it computes the log-normal parameters (mu_ln = ln(x0) + (mu - 0.5*sigma^2)*t, sigma_ln = sigma*sqrt(t)) and delegates pdf, cdf, inv_cdf, mean, variance, skewness, entropy to the statrs implementation.
FGN Implementation Details
The Fractional Gaussian Noise generator uses the Davies-Harte / Circulant Embedding method via FFT. Construction (FGN::new):
- Build the autocovariance vector
r[k] = 0.5 * ((k+1)^(2H) - 2*k^(2H) + (k-1)^(2H))fork = 0..n - Mirror it to form a circulant:
[r[0], ..., r[n], r[n-1], ..., r[1]](length2n) - FFT the circulant to get eigenvalues
- Take
sqrt(eigenvalues / (2n))— stored asArc<Array1<Complex<T>>>for thread-safe sharing - Pre-allocate an
Arc<FftHandler<T>>for the2n-point FFT used during sampling
The n is rounded up to the next power of two (required by the circulant embedding), with an offset tracking the padding.
Sampling (FGN::sample):
- Allocate
Array1<Complex<T>>of length2n - Cast the complex array to
&mut [T]viafrom_raw_parts_mut— this works becauseComplex<T>isrepr(C)with layout{re: T, im: T}, giving a flat slice of4nreal values - Fill the entire flat slice with
SimdNormal::fill_standard_fast— one bulk SIMD Ziggurat pass, no per-element or per-pair overhead - Element-wise multiply by the pre-computed sqrt eigenvalues
- FFT the result
- Extract the real parts from indices
[1..n-offset+1]and scale byn^(-H) * t^H
The Arc wrapping of eigenvalues and FFT handler means sample_par clones only the Arc pointers — no recomputation of the O(n log n) setup.
For CUDA-enabled builds, sample_cuda(m) loads a pre-compiled shared library (libfgn.so/fgn.dll), uploads the eigenvalues once, and generates m paths on the GPU. The CUDA context is cached in a static Mutex<Option<CudaContext>> and reused across calls with matching parameters.
Macro-Generated Bindings
Three macros handle the binding boilerplate:
-
py_process_1d!— for processes withOutput = Array1<T>. Generates a struct withinner_f32: Option<Process<f32>>andinner_f64: Option<Process<f64>>, a#[new]constructor that dispatches based on thedtypestring parameter,sample()that returns a numpy array, andsample_par(m)that collects paths into anArray2and returns a numpy matrix. -
py_process_2x1d!— for processes withOutput = [Array1<T>; 2]. Same pattern butsample()returns a tuple of two numpy arrays. -
py_process_2d!— for processes withOutput = Array2<T>.
Processes with complex constructors (Heston, Bates, jump processes with custom distributions) have hand-written binding classes instead. Bates, for example, accepts a callable for the jump distribution — it wraps it in CallableDist (which implements rand_distr::Distribution<f64> by calling back into the interpreter via pyo3::Python::attach).
The Fn1D<T> and Fn2D<T> enums in traits.rs serve a similar purpose for function parameters — interest rate models like Hull-White accept either a native function pointer (fn(T) -> T) or a callable, dispatched at runtime.
92 classes total are registered in the stochastic_rs_python module.
Other Changes
-
Visualization:
GridPlotteruses plotly withLayoutGridfor multi-panel subplot rendering.register()takes anyProcessExt<T>whoseOutputimplementsPlottable<T>(implemented forArray1<T>,[Array1<T>; 2],Array2<T>). Each registration generatesn_trajtrajectories and adds subplot traces with independent x/y axes. -
Loss functions (
quant::loss): MAE, MSE, RMSE, MPE, MRE, MRPE, MAPE, MSPE, RMSPE — all operating on&[f64]pairs of market vs model values. -
Copulas: moved from
stats::copulasto top-levelcopulasmodule. Bivariate copulas (Clayton, Frank, Gumbel, Independence) implementBivariateExtwithfit()via Kendall's tau → theta estimation,sample()via inverse CDF with Brent root-finding. Multivariate copulas (Gaussian, Tree, Vine) were cleaned up and simplified. -
Malliavin derivatives: integrated directly into the process structs instead of a separate module.
GBM::malliavin()returns[path, D_r S_t = sigma * S_T].Heston::malliavin_of_vol()returns[price, vol, D_r v_t]with the exponential decay term computed for both Sqrt and ThreeHalves power variants.FBM::malliavin()returnsD_s B^H_t = (1/Γ(H+1/2)) * (t-s)^{H-1/2}. -
Allocator options:
features = ["jemalloc"]setstikv_jemallocator::Jemallocas the global allocator,features = ["mimalloc"]setsmimalloc::MiMalloc. Release profile usescodegen-units = 1andlto = true.
Made by love! 🦀🦀