CUDA Programming in Rust: Extending Fractional Noise Generation for AI Models

by Daniel Boros

Mar 8

7 min read

76 views

In this post, I will share my journey of implementing CUDA support in Rust for generating large quantities of synthetic data for AI model training, particularly focusing on fractional stochastic processes. This work is part of the stochastic-rs crate, a Rust library that provides a wide range of stochastic processes—currently nearly 60—for simulation, statistical modeling, and quantitative finance applications. The library supports both classical and fractional stochastic differential equations (SDEs), allowing for more advanced modeling by replacing the usual Wiener process with a fractional Wiener process.

Below, I’ll go through the motivation behind adding CUDA to accelerate the generation of fractional Gaussian noise (fGn), some key steps of the implementation, and how we integrate the resulting GPU-accelerated code into Rust.

Why Fractional Processes?

A fractional stochastic process emerges when we replace the standard Wiener process in an SDE with a fractional Wiener process (fBm). The resulting process can exhibit long-range dependence and other behaviors not captured by standard Brownian motion. In finance, for example, these models can be useful when the usual assumption of uncorrelated increments does not hold, and in physics or other sciences, fractional processes can describe more complex phenomena.

The stochastic-rs crate provides around 60 different processes, many of which also have fractional variations. The core enabler for fractional processes here is generating high-quality fractional Gaussian noise (fGn) efficiently. On the CPU side, the Davies–Harte algorithm (a Fast Fourier Transform-based approach) is used. While this works well for moderate problem sizes, GPU acceleration can help a lot with larger-scale simulations, especially when repeatedly performing FFTs and large matrix multiplications.

Why CUDA in Rust?

While Rust is an excellent language for systems programming, it still lags behind C++ in terms of official, comprehensive CUDA support. Although there is a Rust-CUDA project under development, it does not yet cover everything needed—particularly cuFFT, cuComplex, and curand_kernel for random number generation in complex space. Therefore, the approach chosen was to implement the essential CUDA kernels in .cu files and invoke them from Rust via FFI (Foreign Function Interface).

The primary goal was to keep the entire library in Rust, yet still leverage the GPU for the most computation-intensive parts of fractional Gaussian noise generation. Below is a simplified overview of how the Davies–Harte method was offloaded to the GPU.

Davies–Harte on the GPU: Key Steps

Recall: The Davies–Harte algorithm uses an autocovariance vector to build an embedded circular matrix, then performs an FFT to obtain the eigenvalues. From there, we sample complex Gaussian random variables, multiply them by those eigenvalues, transform back with an inverse FFT, and apply a scaling factor. This yields the fractional Gaussian noise time series.

1. Generating Random Complex Noise and Multiplying by Eigenvalues

We need a CUDA kernel that generates random complex numbers and multiplies them by the previously computed eigenvalues (these are sent to the device once and reused):

__global__ void fill_random_with_eigs(
    cuComplex* d_data,
    const cuComplex* d_sqrt_eigs,
    int traj_size,
    int m,
    unsigned long seed)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= m * traj_size) return;
    int traj_id = tid / traj_size;
    int idx = tid % traj_size;
    curandState state;
    curand_init(seed + traj_id, idx, 0, &state);
    float re = curand_normal(&state);
    float im = curand_normal(&state);
    cuComplex noise = make_cuComplex(re, im);
    d_data[tid] = cuCmulf(noise, d_sqrt_eigs[idx]);
}
  1. traj_id identifies which trajectory we’re generating, allowing parallel simulation of m independent realizations.
  2. traj_size is 2 * n, since the algorithm pads the series and works in the frequency domain.
  3. We use curand_init to seed each thread, then curand_normal to generate random variates for the real and imaginary parts.

2. Performing the FFT

Next, we plan to use cuFFT on the GPU to transform our data. This step is mostly a standard usage of the cuFFT library and must be called from the host side (FFI rules):

{
    cufftHandle plan;
    cufftPlan1d(&plan, traj_size, CUFFT_C2C, m);
    cufftExecC2C(plan, d_data, d_data, CUFFT_FORWARD);
    cudaDeviceSynchronize();
    cufftDestroy(plan);
}

3. Scaling and Copying Data to Output

After the FFT, we apply a scale factor that accounts for the Hurst parameter and time horizon. We then copy the real part of the transformed data back to a float array:

__global__ void scale_and_copy_to_output(
    const cuComplex* d_data,
    float* d_output,
    int n,
    int m,
    int offset,
    float hurst,
    float t)
{
    int out_size = n - offset;
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= m * out_size) return;
    int traj_id = tid / out_size;
    int idx = tid % out_size;
    int data_idx = traj_id * (2 * n) + (idx + 1);
    float scale = powf((float)n, -hurst) * powf(t, hurst);
    d_output[tid] = d_data[data_idx].x * scale;
}

4. The Host-Side Function

Putting it all together in a single host function callable from Rust via FFI:

extern "C" EXPORT void fgn_kernel(
    const cuComplex* d_sqrt_eigs,
    float* d_output,
    int n,
    int m,
    int offset,
    float hurst,
    float t,
    unsigned long seed)
{
    int traj_size = 2 * n;
    cuComplex* d_data = nullptr;
    cudaMalloc(&d_data, (size_t)m * traj_size * sizeof(cuComplex));
    {
        int totalThreads = m * traj_size;
        int blockSize = 512;
        int gridSize = (totalThreads + blockSize - 1) / blockSize;
        fill_random_with_eigs<<<gridSize, blockSize>>>(d_data, d_sqrt_eigs, traj_size, m, seed);
        cudaDeviceSynchronize();
    }
    {
        cufftHandle plan;
        cufftPlan1d(&plan, traj_size, CUFFT_C2C, m);
        cufftExecC2C(plan, d_data, d_data, CUFFT_FORWARD);
        cudaDeviceSynchronize();
        cufftDestroy(plan);
    }
    {
        int out_size = n - offset;
        int totalThreads = m * out_size;
        int blockSize = 512;
        int gridSize = (totalThreads + blockSize - 1) / blockSize;
        scale_and_copy_to_output<<<gridSize, blockSize>>>(d_data, d_output, n, m, offset, hurst, t);
        cudaDeviceSynchronize();
    }
    cudaFree(d_data);
}

  1. Allocates device memory for d_data.
  2. Uses fill_random_with_eigs to populate data with random complex values and multiply by the eigenvalues.
  3. Executes cuFFT on each realization (m).
  4. Scales and copies the results into d_output.
  5. Frees GPU memory.

Integrating with Rust

Now, for the fun part: calling this C++/CUDA code from Rust. Projects like cudarc offer convenient Rust wrappers for CUDA functionalities, but not all features are always available out of the box. Below is a condensed version of how we can dynamically load the .so or .dll generated by nvcc and call the fgn_kernel function.

Note: We compile the .cu file using commands like:

# Linux
nvcc -shared -o libfgn.so fgn.cu -Xcompiler -fPIC

# Windows
nvcc -shared fgn.cu -o fgn.dll -lcufft

Rust-Side Code Snippet

fn sample_cuda(&self) -> Result<Array2<f64>, Box<dyn Error>> {
    use std::ffi::c_void;
    use cudarc::driver::{CudaDevice, DevicePtr, DevicePtrMut, DeviceRepr};
    use libloading::{Library, Symbol};

    #[repr(C)]
    #[derive(Debug, Default, Copy, Clone)]
    pub struct cuComplex {
        pub x: f32,
        pub y: f32,
    }

    unsafe impl DeviceRepr for cuComplex {
        fn as_kernel_param(&self) -> *mut c_void {
            self as *const Self as *mut _
        }
    }

    type FgnKernelFn = unsafe extern "C" fn(
        *const cuComplex,
        *mut f32,
        i32,
        i32,
        i32,
        f32,
        f32,
        u64
    );

    #[cfg(target_os = "windows")]
    let lib = unsafe { Library::new("src/stochastic/cuda/fgn_windows/fgn.dll") }?;

    #[cfg(target_os = "linux")]
    let lib = unsafe { Library::new("src/stochastic/cuda/fgn_linux/libfgn.so") }?;

    let fgn_kernel: Symbol<FgnKernelFn> = unsafe { lib.get(b"fgn_kernel") }?;
    let device = CudaDevice::new(0)?;

    let m = self.m.unwrap_or(1);
    let n = self.n;
    let offset = self.offset;
    let hurst = self.hurst;
    let t = self.t.unwrap_or(1.0);
    let seed = 42u64;

    // Prepare eigenvalues on host
    let host_sqrt_eigs: Vec<cuComplex> = self
        .sqrt_eigenvalues
        .iter()
        .map(|z| cuComplex {
            x: z.re as f32,
            y: z.im as f32,
        })
        .collect();

    // Transfer to device
    let d_sqrt_eigs = device.htod_copy(host_sqrt_eigs)?;
    let mut d_output = device.alloc_zeros::<f32>(m * (n - offset))?;

    unsafe {
        fgn_kernel(
            (*d_sqrt_eigs.device_ptr()) as *const cuComplex,
            (*d_output.device_ptr_mut()) as *mut f32,
            n as i32,
            m as i32,
            offset as i32,
            hurst as f32,
            t as f32,
            seed
        );
    }

    // Copy data back to host
    let host_output: Vec<f32> = device.sync_reclaim(d_output)?;
    let mut fgn = Array2::<f64>::zeros((m, n - offset));

    for i in 0..m {
        for j in 0..(n - offset) {
            fgn[[i, j]] = host_output[i * (n - offset) + j] as f64;
        }
    }

    Ok(fgn)
}
  1. We define our own cuComplex in Rust that matches the CUDA structure, along with an implementation of DeviceRepr.
  2. We load the shared library with libloading.
  3. We create or select a CUDA device using cudarc::driver::CudaDevice.
  4. Memory is allocated and transferred between the host and device.
  5. We call the fgn_kernel function from our .so or .dll.
  6. Once the GPU finishes, we bring the result back to Rust.

Final Thoughts

At first glance, Rust + CUDA might seem daunting, especially if you come from a C++ background where CUDA integration is mature and straightforward. However, the Rust ecosystem is growing rapidly, and projects like Rust-CUDA, cudarc, and others are making GPU programming in Rust more accessible. While there are still some rough edges—like incomplete wrappers for various CUDA libraries—this approach demonstrates that it is entirely possible to write the bulk of your application in Rust while accelerating the most computationally heavy tasks with CUDA kernels.

For fractional stochastic processes and other HPC tasks, this approach offers both safety and performance. The stochastic-rs library will continue to grow, adding more specialized fractional processes and statistical models for quantitative finance and beyond. With this GPU acceleration in place, generating large trajectories of fractional Gaussian noise has become more efficient, enabling more robust simulations and model training pipelines.

If you’re a Rust enthusiast looking to do heavy GPU work, I hope this brief guide helps illuminate the path forward. And if you’re interested in fractional processes and stochastics, feel free to check out the repository for stochastic-rs and contribute to the discussion or the code.

Happy coding, and may your FFTs be fast and your threads be plenty!