Speed up my financial mathetmatics crates in Rust

I am studying financial mathematics in an applied mathematics PhD program, where my research area focuses on modeling financial processes using stochastic processes. I estimate and examine the parameters of various stochastic processes with DNNs. To properly train neural networks with such data, tens to hundreds of millions of generated synthetic data points are needed (from now on, let's call them trajectories, as realizations from $\Omega$ ). Since I'm a Rust enthusiast, despite the Python hype, I write all my research programs in Rust, and in my everyday work, I try to work in Rust as much as possible. This is how the stochastic-rs library was born, which originally started as a university research project, but I decided to expand it later with the necessary statistics for stochastic processes, potentially pre-built neural networks, financial pricing formulas, and possibly even the implementation of the Malliavin operator (though this is very far in the future).

The main direction of my research focuses on the study of fractional processes, which, if we are somewhat familiar with the topic, we can simply say have a covariance structure and are not Markov processes, meaning their increments are not independent. They are a generalization of the ordinary Brownian motion and belong to the family of isonormal processes.

For those not familiar with the topic, a simple way to imagine this is by taking a time series and saying that this process 'remembers' its past. How long it remembers is determined by the parameter of the process.

Let's see the exact definition of the previously mentioned processes:

1. Brownian Motion (Wiener Process):

A Brownian motion, denoted by $B(t)$ , is a continuous-time stochastic process that satisfies the following properties:

$B(0) = 0$ (starts at zero).
For any $0 \leq t_1 < t_2 < \dots < t_n$ , the increments $B(t_{i+1}) - B(t_i)$ are independent for all $i$ .
The increments $B(t_{i+1}) - B(t_i)$ are normally distributed with mean 0 and variance $t_{i+1} - t_i$ (i.e., $B(t_{i+1}) - B(t_i) \sim \mathcal{N}(0, t_{i+1} - t_i)$ ).
$B(t)$ has continuous paths almost surely (i.e., with probability 1).

2. Fractional Brownian Motion (fBm):

A fractional Brownian motion, denoted by $B_H(t)$ , is a generalization of Brownian motion that depends on a parameter $H \in (0, 1)$ , called the Hurst parameter. It has the following properties:

$B_H(0) = 0$ (starts at zero).
It has stationary increments, meaning the distribution of $B_H(t+s) - B_H(s)$ depends only on $t$ .
The covariance of $B_H(t)$ and $B_H(s)$ is given by: $\mathbb{E}[B_H(t) B_H(s)] = \frac{1}{2} \left( t^{2H} + s^{2H} - |t - s|^{2H} \right)$
Unlike Brownian motion, for $H \neq 1/2$ , the increments of fractional Brownian motion are not independent, and for $H > 1/2$ , the process exhibits long-range dependence or "memory."

For $H = 1/2$ , fractional Brownian motion reduces to standard Brownian motion.

3. Isonormal Process:

An isonormal Gaussian process, often used in stochastic analysis, is a centered Gaussian family of random variables $\{ X(h) : h \in H \}$ , indexed by elements of a real Hilbert space $H$ , such that:

$\mathbb{E}[X(h)] = 0$ for all $h \in H$ .
The covariance structure is given by: $\mathbb{E}[X(h_1) X(h_2)] = \langle h_1, h_2 \rangle_H$ where $\langle \cdot, \cdot \rangle_H$ is the inner product in the Hilbert space $H$ .

Let's do some programming

My goal was to create the fastest fractional Brownian motion generator available in the open-source world. In the literature, various methods for generating fBm are known. Among them are exact and approximate methods. The two most well-known are generation using Cholesky decomposition and the Davies-Harte method. Since the Cholesky method has a computational complexity of $O(n^3)$ , it was clearly ruled out. Other methods exist, such as the Hosking method, which uses Fast Fourier or Z-transformation. In recent research, fractional noise has even been generated using quantum physical methods. However, I chose the Davies-Harte Fast Fourier algorithm, which is a widely used method for generating fast and accurate trajectories.

Without going into the details of the algorithm, the essence of the method is that we can compute the square root of the covariance matrix by embedding it into a circulant matrix, and its Fourier transform becomes the required matrix. We then multiply this by a random matrix, resulting in two independent fractional Gaussian noises, the cumulative of which gives the fractional Gaussian noise. For other methods and a more detailed description, you can refer to this paper: link

Although not from the very first attempt, let's see how our code progressed towards maximum performance. First, the logic was implemented. As I mentioned earlier, this includes the generation of the covariance matrix, fast Fourier transformation, and finally multiplying it by the random matrix.

use std::sync::{Arc, Mutex};

use ndarray::parallel::prelude::*;
use ndarray::{concatenate, prelude::*};
use ndarray_rand::rand_distr::StandardNormal;
use ndarray_rand::RandomExt;
use ndrustfft::{ndfft, FftHandler};
use num_complex::{Complex, ComplexDistribution};

pub fn fgn_generate(hurst: f64, n: usize, t: Option<f64>) -> Array1<f64> {
    if !(0.0..=1.0).contains(&hurst) {
        panic!("Hurst parameter must be between 0 and 1");
    }

    let mut r = Array1::linspace(0.0, n as f64, n + 1);
    r.mapv_inplace(|x| {
        if x == 0.0 {
            1.0
        } else {
            0.5 * ((x + 1.0).powf(2.0 * hurst)
                - 2.0 * x.powf(2.0 * hurst)
                + (x - 1.0).powf(2.0 * hurst))
        }
    });

    let r = concatenate(
        Axis(0),
        &[r.view(), r.slice(s![..;-1]).slice(s![1..-1]).view()],
    )
    .unwrap();

    let data = r.mapv(|v| Complex::new(v, 0.0));
    let r_fft = FftHandler::new(r.len());
    let mut sqrt_eigenvalues = Array1::<Complex<f64>>::zeros(r.len());
    ndfft(&data, &mut sqrt_eigenvalues, &r_fft, 0);
    sqrt_eigenvalues.mapv_inplace(|x| Complex::new((x.re / (2.0 * n as f64)).sqrt(), x.im));

    let sqrt_eigenvalues = Arc::new(sqrt_eigenvalues);
    let fft_handler = Arc::new(FftHandler::new(2 * n));
    let rnd = Array1::<Complex<f64>>::random(2 * n, ComplexDistribution::new(StandardNormal, StandardNormal));

    let fgn = &*sqrt_eigenvalues * &rnd;
    let mut fgn_fft = Array1::<Complex<f64>>::zeros(2 * n);
    ndfft(&fgn, &mut fgn_fft, &*fft_handler, 0);
    let scale = (n as f64).powf(-hurst) * t.unwrap_or(1.0).powf(hurst);
    let fgn_result = fgn_fft
        .slice(s![1..n])
        .mapv(|x: Complex<f64>| x.re * scale);
    
    fgn_result
}

Now, let's look at the fundamental issues. In time series analysis, we often need to generate trajectories thousands of times, not just once, and in this case, a function-based solution is not very advantageous. As long as the Hurst parameter of the fractional Brownian motion doesn't change, the algorithm remains the same up to the generation of the complex random matrix. However, if we generate everything within the function, the covariance matrix must be regenerated for each trajectory, which will be $O(n^2)$ at best. So, let's make the following modifications.

In the next step, we take advantage of Rust's support for structs, which allows us to store an internal state for the covariance matrix during generation. This way, we only need to call the sample function, which generates the process from the remaining code, making the generation much more efficient.

pub struct Fgn {
  pub hurst: f64,
  pub n: usize,
  pub t: Option<f64>,
  pub m: Option<usize>,
  pub offset: usize,
  pub sqrt_eigenvalues: Arc<Array1<Complex<f64>>>,
  pub fft_handler: Arc<FftHandler<f64>>,
}

impl Default for Fgn {
  fn default() -> Self {
    Self::new(0.7, 1024, None, None)
  }
}

impl Fgn {
  #[must_use]
  pub fn new(hurst: f64, n: usize, t: Option<f64>, m: Option<usize>) -> Self {
    if !(0.0..=1.0).contains(&hurst) {
      panic!("Hurst parameter must be between 0 and 1");
    }

    let mut r = Array1::linspace(0.0, n as f64, n + 1);
    r.mapv_inplace(|x| {
      if x == 0.0 {
        1.0
      } else {
        0.5
          * ((x + 1.0).powf(2.0 * hurst) - 2.0 * x.powf(2.0 * hurst) + (x - 1.0).powf(2.0 * hurst))
      }
    });
    let r = concatenate(
      Axis(0),
      #[allow(clippy::reversed_empty_ranges)]
      &[r.view(), r.slice(s![..;-1]).slice(s![1..-1]).view()],
    )
    .unwrap();
    let data = r.mapv(|v| Complex::new(v, 0.0));
    let r_fft = FftHandler::new(r.len());
    let mut sqrt_eigenvalues = Array1::<Complex<f64>>::zeros(r.len());
    ndfft(&data, &mut sqrt_eigenvalues, &r_fft, 0);
    sqrt_eigenvalues.mapv_inplace(|x| Complex::new((x.re / (2.0 * n as f64)).sqrt(), x.im));

    Self {
      hurst,
      n,
      offset,
      t,
      sqrt_eigenvalues: Arc::new(sqrt_eigenvalues),
      m,
      fft_handler: Arc::new(FftHandler::new(2 * n)),
    }
  }
}

impl Sampling<f64> for Fgn {
  fn sample(&self) -> Array1<f64> {
    let rnd = Array1::<Complex<f64>>::random(2 * n, ComplexDistribution::new(StandardNormal, StandardNormal));

    let fgn = &*self.sqrt_eigenvalues * &rnd;
    let mut fgn_fft = Array1::<Complex<f64>>::zeros(2 * self.n);
    ndfft(&fgn, &mut fgn_fft, &*self.fft_handler, 0);
    let scale = (self.n as f64).powf(-self.hurst) * self.t.unwrap_or(1.0).powf(self.hurst);
    let fgn = fgn_fft
      .slice(s![1..self.n - self.offset + 1])
      .mapv(|x: Complex<f64>| x.re * scale);
    fgn
  }
}

So we created an Fgn struct, which will store the relevant variables, and during the program's execution, we only need to call the sample function, which will generate the trajectory for us.

Let’s move forward and look at the steps we can take to speed up the generation process.

Take advantage of the Fast Fourier algorithm and ensure that $n$ is always a power of 2 during the generation.
Focus on branchless programming, so the code never has to backtrack and can simply proceed from top to bottom.
In the code, we have already optimized most of the memory copying, using references instead of .clone() wherever possible, except where external library limitations prevent this.
By measuring the execution times step by step, we find that generating the complex random matrix takes a significant amount of time, and we can parallelize this by breaking it into chunks.
Finally, we can also generate multiple trajectories in parallel.

Let’s apply these modifications. The final version of out implementation is the following:

use std::sync::{Arc, Mutex};

use ndarray::parallel::prelude::*;
use ndarray::{concatenate, prelude::*};
use ndarray_rand::rand_distr::StandardNormal;
use ndarray_rand::RandomExt;
use ndrustfft::{ndfft, FftHandler};
use num_complex::{Complex, ComplexDistribution};

pub trait Sampling<T: Clone + Send + Sync + Zero>: Send + Sync {
  fn sample(&self) -> Array1<T>;
  fn sample_par(&self) -> Array2<T> {
    if self.m().is_none() {
      panic!("m must be specified for parallel sampling");
    }

    let mut xs = Array2::zeros((self.m().unwrap(), self.n()));

    xs.axis_iter_mut(Axis(0)).into_par_iter().for_each(|mut x| {
      x.assign(&self.sample());
    });

    xs
  }
  fn n(&self) -> usize;
  fn m(&self) -> Option<usize>;
}

pub struct Fgn {
  pub hurst: f64,
  pub n: usize,
  pub t: Option<f64>,
  pub m: Option<usize>,
  pub offset: usize,
  pub sqrt_eigenvalues: Arc<Array1<Complex<f64>>>,
  pub fft_handler: Arc<FftHandler<f64>>,
}

impl Default for Fgn {
  fn default() -> Self {
    Self::new(0.7, 1024, None, None)
  }
}

impl Fgn {
  #[must_use]
  pub fn new(hurst: f64, n: usize, t: Option<f64>, m: Option<usize>) -> Self {
    if !(0.0..=1.0).contains(&hurst) {
      panic!("Hurst parameter must be between 0 and 1");
    }

    let offset = n.next_power_of_two() - n;
    let n = n.next_power_of_two();
    let mut r = Array1::linspace(0.0, n as f64, n + 1);
    r.mapv_inplace(|x| {
      if x == 0.0 {
        1.0
      } else {
        0.5
          * ((x + 1.0).powf(2.0 * hurst) - 2.0 * x.powf(2.0 * hurst) + (x - 1.0).powf(2.0 * hurst))
      }
    });
    let r = concatenate(
      Axis(0),
      #[allow(clippy::reversed_empty_ranges)]
      &[r.view(), r.slice(s![..;-1]).slice(s![1..-1]).view()],
    )
    .unwrap();
    let data = r.mapv(|v| Complex::new(v, 0.0));
    let r_fft = FftHandler::new(r.len());
    let mut sqrt_eigenvalues = Array1::<Complex<f64>>::zeros(r.len());
    ndfft(&data, &mut sqrt_eigenvalues, &r_fft, 0);
    sqrt_eigenvalues.mapv_inplace(|x| Complex::new((x.re / (2.0 * n as f64)).sqrt(), x.im));

    Self {
      hurst,
      n,
      offset,
      t,
      sqrt_eigenvalues: Arc::new(sqrt_eigenvalues),
      m,
      fft_handler: Arc::new(FftHandler::new(2 * n)),
    }
  }
}

impl Sampling<f64> for Fgn {
  fn sample(&self) -> Array1<f64> {
    let num_threads = rayon::current_num_threads();
    let chunk_size = (2 * self.n) / num_threads;
    let rnd = Arc::new(Mutex::new(Array1::<Complex<f64>>::zeros(2 * self.n)));

    (0..num_threads).into_par_iter().for_each(|i| {
      let chunk = Array1::<Complex<f64>>::random(
        chunk_size,
        ComplexDistribution::new(StandardNormal, StandardNormal),
      );

      let mut result_lock = rnd.lock().unwrap();
      result_lock
        .slice_mut(s![i * chunk_size..(i + 1) * chunk_size])
        .assign(&chunk);
    });

    let fgn = &*self.sqrt_eigenvalues * &*rnd.lock().unwrap();
    let mut fgn_fft = Array1::<Complex<f64>>::zeros(2 * self.n);
    ndfft(&fgn, &mut fgn_fft, &*self.fft_handler, 0);
    let scale = (self.n as f64).powf(-self.hurst) * self.t.unwrap_or(1.0).powf(self.hurst);
    let fgn = fgn_fft
      .slice(s![1..self.n - self.offset + 1])
      .mapv(|x: Complex<f64>| x.re * scale);
    fgn
  }

  fn n(&self) -> usize {
    self.n - self.offset
  }

  fn m(&self) -> Option<usize> {
    self.m
  }
}

Results

Currently, this is where we are, but I wanted to go beyond this. Let’s see what else we can do without modifying the crates we are using. Considering that we are only using purely Rust-implemented crates, all of them support OS-level optimizations, such as AVX, NEON, Parallel, and SSE, for example, for the Fast Fourier algorithm.

Implementation	Time (seconds)
fn impl 10000 sequences, 10000 long	14.61
final single gen	4.60
final single gen without multithreaded complex distr	5.30
final multi thread	1.12
final multi thread gen without multithreaded complex distr	1.13

After extensive research, I concluded that it’s worth looking into Rust compiler settings if we want to achieve further optimization and speed up the code. The Cargo.toml offers numerous options to fine-tune the compiler. In the stochastic-rs crate, what led to another performance jump was redefining the default memory allocator by using tikv-jemalloc and setting code-units=1, which ensures that the compiler does not split the binary.

After these changes, the following results were achieved:

Implementation	Time (seconds)
fn impl 10000 sequences, 10000 long	13.6
final single gen	3.84
final multi thread	0.71

Conclusion:

Optimizing the fractional Brownian motion generator has been a rewarding process, reaffirming the power and flexibility of Rust in high-performance computing tasks. Starting with a 14-second runtime, I knew there was room for significant improvement. Through algorithmic adjustments and leveraging Rust's system-level capabilities, I was able to make substantial gains.

The biggest breakthroughs came from careful parallelization, avoiding redundant computations, and tapping into Rust’s compiler optimizations. Implementing tikv-jemalloc for memory allocation and setting code-units=1 were crucial steps that allowed the code to perform at its best, reducing the runtime to 0.71 seconds.

These optimizations are not just about one-off performance gains; they reflect the potential Rust offers for continuous refinement. It’s a language that rewards precision, giving developers control over every part of the process. The improvements here demonstrate that, with the right approach, even the most complex algorithms can be pushed to their limits, resulting in faster and more efficient execution. There's still more room to explore, but this marks a significant step forward in maximizing the potential of both the code and the tools Rust provides.