stochastic-rs v1 stable

by Daniel Boros

Feb 14

8 min read

8 views

SIMD Random Number Generation

The library now ships its own SIMD-accelerated PRNG in simd_rng::SimdRng. It runs two independent engines:

  • Xoshiro256++ ×4 for f64 — operates on u64x4 (from the wide crate), produces 4 u64 values per state advance, converts to f64 by right-shifting 11 bits and scaling by 1.0 / (1u64 << 53)
  • Xoshiro128++ ×8 for f32 — operates on u32x8, produces 8 u32 values per state advance, converts to f32 by right-shifting 8 bits and scaling by 1.0 / (1u32 << 24)

Both are seeded from rand::rng() at construction. SimdRng implements RngCore, so it can be used as a drop-in replacement for any standard rng, but the primary interface is next_f64_array() -> [f64; 8] and next_f32_array() -> [f32; 8] for batch consumption. There's also next_i32x8() -> i32x8 used directly by the Ziggurat samplers to avoid the float conversion step entirely.

SIMD Ziggurat Algorithm (Normal, Exp)

SimdNormal<T: SimdFloatExt, const N: usize> implements the Ziggurat method with a SIMD fast path. The algorithm per iteration:

  1. Draw 8 random i32 values via SimdRng::next_i32x8()
  2. Compute iz = hz & 127 (table index, 128 rectangles)
  3. Gather 8 kn[iz] threshold values into an i32x8
  4. SIMD compare: abs(hz) < kn[iz] → produces an 8-lane mask
  5. If all 8 accepted (accept.all()): compute result = hz_float * wn[iz] in SIMD, apply mean/stddev scaling via mean_simd + std_dev_simd * result, write 8 values to output buffer at once
  6. If any rejected: fall through to scalar loop — accepted lanes get the SIMD result, rejected lanes call nfix() which handles the tail (Marsaglia's tail algorithm for iz == 0, rejection sampling for rectangle edges)

The Ziggurat tables (kn, wn, fn_tab — 128 entries each) are computed once via OnceLock and shared across all instances.

There are two codepaths: fill_ziggurat() applies mean+stddev scaling, fill_ziggurat_standard() skips it (outputs N(0,1) directly). The standard variant is what FGN uses internally.

The const generic N controls the internal buffer size. When calling distribution.sample(&mut rng) (single value), it actually refills the entire N-element buffer via the SIMD path and returns values from it one by one. fill_slice() and fill_standard_fast() bypass the buffer entirely and write straight into the caller's slice.

SimdExpZig follows the same pattern but with 256 rectangles (iz = hz & 0xFF) and exponential tail handling (efix()).

Other Distributions

The remaining distributions build on top of SimdNormal, SimdExpZig, and SimdRng through different strategies:

Direct SIMD transform — distributions that are a closed-form function of uniform samples. These process 8 values per iteration through SIMD math:

  • SimdCauchy: generates 8 uniforms via fill_uniform_simd, computes x0 + gamma * tan(π * (u - 0.5)) in SIMD using simd_tan
  • SimdPareto: scale * u^(-1/alpha) via simd_powf
  • SimdLogNormal: draws from SimdNormal, applies simd_exp
  • SimdWeibull, SimdUniform: similar transform patterns

Rejection sampling with SIMD base — distributions that need accept/reject loops but use SIMD-accelerated normals/uniforms internally:

  • SimdGamma: Marsaglia & Tsang's method — draws normal samples via the internal SimdNormal, uniform checks via SimdRng. For alpha < 1, uses the alpha + 1 trick with a power-of-uniform correction
  • SimdBeta: generates two independent Gamma samples, computes x / (x + y)
  • SimdChiSquared: delegates to SimdGamma(k/2, 2.0)
  • SimdStudentT: normal / sqrt(chi_squared / k)
  • SimdPoisson: Knuth's algorithm with SimdExpZig-accelerated exponential draws

All distributions use a 16-element internal buffer (UnsafeCell<[T; 16]>) and refill via their respective fill_slice_fast() when exhausted. They implement rand_distr::Distribution<T>, so they're compatible with the standard dist.sample(&mut rng) API.

Benchmarks vs rand_distr

Measured with Criterion on Apple M-series, --release, 100K samples:

  • Normal f64 N=32: 201µs vs 973µs → 4.84x
  • Exp f64 N=32: 188µs vs 924µs → 4.91x
  • Beta f64: 1129µs vs 4652µs → 4.12x
  • Cauchy f32: 231µs vs 1010µs → 4.37x
  • StudentT f64: 784µs vs 2332µs → 2.97x
  • Gamma f64: 560µs vs 1490µs → 2.66x
  • LogNormal f64: 468µs vs 1284µs → 2.74x
  • ChiSquared f64: 545µs vs 1478µs → 2.71x
  • Poisson u32: 2166µs vs 4235µs → 1.96x
  • Weibull f64: 1025µs vs 1510µs → 1.47x
  • Uniform: ~1:1 (already bottlenecked by RNG throughput)

The biggest wins are on Normal/Exp (Ziggurat fast path) and distributions that compose them (Beta = two Gammas, each calling Normal internally). Distributions that are dominated by transcendental functions (Weibull's powf) see smaller gains since the SIMD math functions themselves are the bottleneck.

Generic Precision — <T: FloatExt>

The f32 and f64 feature flags are removed. All models are generic over T: FloatExt. The trait hierarchy:

pub trait SimdFloatExt: num_traits::Float + Default + Send + Sync + 'static {
  type Simd: Copy + Mul<Output = Self::Simd> + Add<Output = Self::Simd> + ...;

  fn splat(val: Self) -> Self::Simd;
  fn simd_from_array(arr: [Self; 8]) -> Self::Simd;
  fn simd_to_array(v: Self::Simd) -> [Self; 8];
  fn simd_ln(v: Self::Simd) -> Self::Simd;
  fn simd_sqrt(v: Self::Simd) -> Self::Simd;
  fn simd_exp(v: Self::Simd) -> Self::Simd;
  fn fill_uniform_simd(rng: &mut SimdRng, out: &mut [Self]);
  fn simd_from_i32x8(v: i32x8) -> Self::Simd;
  fn from_f64_fast(v: f64) -> Self;
  // ...
}

SimdFloatExt is implemented for f32 (where Simd = f32x8) and f64 (where Simd = f64x8). This is what allows a single generic fill_ziggurat<T> to emit SIMD code for both precisions — the trait dispatches to the right wide type at compile time.

FloatExt extends SimdFloatExt with num_traits::FromPrimitive, FloatConst, ScalarOperand, AddAssign, SubAssign, and provides from_usize_() and normal_array(). Every stochastic process is bounded by T: FloatExt.

The from_f64_fast(v: f64) -> Self method is used throughout to convert compile-time constants (like 0.5, 0.0331 in the Gamma sampler) without going through FromPrimitive's Option — for f64 it's identity, for f32 it's a cast.

Trait System

pub trait ProcessExt<T: FloatExt>: Send + Sync {
  type Output: Send;
  fn sample(&self) -> Self::Output;

  fn sample_par(&self, m: usize) -> Vec<Self::Output> {
    (0..m).into_par_iter().map(|_| self.sample()).collect()
  }

  #[cfg(feature = "cuda")]
  fn sample_cuda(&self, _m: usize) -> Result<Either<Array1<f64>, Array2<f64>>> { ... }
}

ProcessExt requires Send + Sync so sample_par can call sample() from rayon worker threads. The Output type is Array1<T> for single-component processes (GBM, OU, FGN, etc.) and [Array1<T>; 2] for two-component processes (Heston returns [price, variance], SABR returns [forward, vol], correlated Gaussian noise returns [gn1, gn2]).

DistributionExt provides analytical distribution properties for processes that have them. GBM implements it by wrapping statrs::LogNormal — at construction time it computes the log-normal parameters (mu_ln = ln(x0) + (mu - 0.5*sigma^2)*t, sigma_ln = sigma*sqrt(t)) and delegates pdf, cdf, inv_cdf, mean, variance, skewness, entropy to the statrs implementation.

FGN Implementation Details

The Fractional Gaussian Noise generator uses the Davies-Harte / Circulant Embedding method via FFT. Construction (FGN::new):

  1. Build the autocovariance vector r[k] = 0.5 * ((k+1)^(2H) - 2*k^(2H) + (k-1)^(2H)) for k = 0..n
  2. Mirror it to form a circulant: [r[0], ..., r[n], r[n-1], ..., r[1]] (length 2n)
  3. FFT the circulant to get eigenvalues
  4. Take sqrt(eigenvalues / (2n)) — stored as Arc<Array1<Complex<T>>> for thread-safe sharing
  5. Pre-allocate an Arc<FftHandler<T>> for the 2n-point FFT used during sampling

The n is rounded up to the next power of two (required by the circulant embedding), with an offset tracking the padding.

Sampling (FGN::sample):

  1. Allocate Array1<Complex<T>> of length 2n
  2. Cast the complex array to &mut [T] via from_raw_parts_mut — this works because Complex<T> is repr(C) with layout {re: T, im: T}, giving a flat slice of 4n real values
  3. Fill the entire flat slice with SimdNormal::fill_standard_fast — one bulk SIMD Ziggurat pass, no per-element or per-pair overhead
  4. Element-wise multiply by the pre-computed sqrt eigenvalues
  5. FFT the result
  6. Extract the real parts from indices [1..n-offset+1] and scale by n^(-H) * t^H

The Arc wrapping of eigenvalues and FFT handler means sample_par clones only the Arc pointers — no recomputation of the O(n log n) setup.

For CUDA-enabled builds, sample_cuda(m) loads a pre-compiled shared library (libfgn.so/fgn.dll), uploads the eigenvalues once, and generates m paths on the GPU. The CUDA context is cached in a static Mutex<Option<CudaContext>> and reused across calls with matching parameters.

Macro-Generated Bindings

Three macros handle the binding boilerplate:

  • py_process_1d! — for processes with Output = Array1<T>. Generates a struct with inner_f32: Option<Process<f32>> and inner_f64: Option<Process<f64>>, a #[new] constructor that dispatches based on the dtype string parameter, sample() that returns a numpy array, and sample_par(m) that collects paths into an Array2 and returns a numpy matrix.

  • py_process_2x1d! — for processes with Output = [Array1<T>; 2]. Same pattern but sample() returns a tuple of two numpy arrays.

  • py_process_2d! — for processes with Output = Array2<T>.

Processes with complex constructors (Heston, Bates, jump processes with custom distributions) have hand-written binding classes instead. Bates, for example, accepts a callable for the jump distribution — it wraps it in CallableDist (which implements rand_distr::Distribution<f64> by calling back into the interpreter via pyo3::Python::attach).

The Fn1D<T> and Fn2D<T> enums in traits.rs serve a similar purpose for function parameters — interest rate models like Hull-White accept either a native function pointer (fn(T) -> T) or a callable, dispatched at runtime.

92 classes total are registered in the stochastic_rs_python module.

Other Changes

  • Visualization: GridPlotter uses plotly with LayoutGrid for multi-panel subplot rendering. register() takes any ProcessExt<T> whose Output implements Plottable<T> (implemented for Array1<T>, [Array1<T>; 2], Array2<T>). Each registration generates n_traj trajectories and adds subplot traces with independent x/y axes.

  • Loss functions (quant::loss): MAE, MSE, RMSE, MPE, MRE, MRPE, MAPE, MSPE, RMSPE — all operating on &[f64] pairs of market vs model values.

  • Copulas: moved from stats::copulas to top-level copulas module. Bivariate copulas (Clayton, Frank, Gumbel, Independence) implement BivariateExt with fit() via Kendall's tau → theta estimation, sample() via inverse CDF with Brent root-finding. Multivariate copulas (Gaussian, Tree, Vine) were cleaned up and simplified.

  • Malliavin derivatives: integrated directly into the process structs instead of a separate module. GBM::malliavin() returns [path, D_r S_t = sigma * S_T]. Heston::malliavin_of_vol() returns [price, vol, D_r v_t] with the exponential decay term computed for both Sqrt and ThreeHalves power variants. FBM::malliavin() returns D_s B^H_t = (1/Γ(H+1/2)) * (t-s)^{H-1/2}.

  • Allocator options: features = ["jemalloc"] sets tikv_jemallocator::Jemalloc as the global allocator, features = ["mimalloc"] sets mimalloc::MiMalloc. Release profile uses codegen-units = 1 and lto = true.

Made by love! 🦀🦀