CUDA Programming in Rust: Extending Fractional Noise Generation for AI Models
by Daniel Boros
Mar 8
7 min read
76 views
In this post, I will share my journey of implementing CUDA support in Rust for generating large quantities of synthetic data for AI model training, particularly focusing on fractional stochastic processes. This work is part of the stochastic-rs crate, a Rust library that provides a wide range of stochastic processes—currently nearly 60—for simulation, statistical modeling, and quantitative finance applications. The library supports both classical and fractional stochastic differential equations (SDEs), allowing for more advanced modeling by replacing the usual Wiener process with a fractional Wiener process.
Below, I’ll go through the motivation behind adding CUDA to accelerate the generation of fractional Gaussian noise (fGn), some key steps of the implementation, and how we integrate the resulting GPU-accelerated code into Rust.
Why Fractional Processes?
A fractional stochastic process emerges when we replace the standard Wiener process in an SDE with a fractional Wiener process (fBm). The resulting process can exhibit long-range dependence and other behaviors not captured by standard Brownian motion. In finance, for example, these models can be useful when the usual assumption of uncorrelated increments does not hold, and in physics or other sciences, fractional processes can describe more complex phenomena.
The stochastic-rs crate provides around 60 different processes, many of which also have fractional variations. The core enabler for fractional processes here is generating high-quality fractional Gaussian noise (fGn) efficiently. On the CPU side, the Davies–Harte algorithm (a Fast Fourier Transform-based approach) is used. While this works well for moderate problem sizes, GPU acceleration can help a lot with larger-scale simulations, especially when repeatedly performing FFTs and large matrix multiplications.
Why CUDA in Rust?
While Rust is an excellent language for systems programming, it still lags behind C++ in terms of official, comprehensive CUDA support. Although there is a Rust-CUDA project under development, it does not yet cover everything needed—particularly cuFFT
, cuComplex
, and curand_kernel
for random number generation in complex space. Therefore, the approach chosen was to implement the essential CUDA kernels in .cu
files and invoke them from Rust via FFI (Foreign Function Interface).
The primary goal was to keep the entire library in Rust, yet still leverage the GPU for the most computation-intensive parts of fractional Gaussian noise generation. Below is a simplified overview of how the Davies–Harte method was offloaded to the GPU.
Davies–Harte on the GPU: Key Steps
Recall: The Davies–Harte algorithm uses an autocovariance vector to build an embedded circular matrix, then performs an FFT to obtain the eigenvalues. From there, we sample complex Gaussian random variables, multiply them by those eigenvalues, transform back with an inverse FFT, and apply a scaling factor. This yields the fractional Gaussian noise time series.
1. Generating Random Complex Noise and Multiplying by Eigenvalues
We need a CUDA kernel that generates random complex numbers and multiplies them by the previously computed eigenvalues (these are sent to the device once and reused):
__global__ void fill_random_with_eigs(
cuComplex* d_data,
const cuComplex* d_sqrt_eigs,
int traj_size,
int m,
unsigned long seed)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= m * traj_size) return;
int traj_id = tid / traj_size;
int idx = tid % traj_size;
curandState state;
curand_init(seed + traj_id, idx, 0, &state);
float re = curand_normal(&state);
float im = curand_normal(&state);
cuComplex noise = make_cuComplex(re, im);
d_data[tid] = cuCmulf(noise, d_sqrt_eigs[idx]);
}
traj_id
identifies which trajectory we’re generating, allowing parallel simulation ofm
independent realizations.traj_size
is2 * n
, since the algorithm pads the series and works in the frequency domain.- We use
curand_init
to seed each thread, thencurand_normal
to generate random variates for the real and imaginary parts.
2. Performing the FFT
Next, we plan to use cuFFT on the GPU to transform our data. This step is mostly a standard usage of the cuFFT library and must be called from the host side (FFI rules):
{
cufftHandle plan;
cufftPlan1d(&plan, traj_size, CUFFT_C2C, m);
cufftExecC2C(plan, d_data, d_data, CUFFT_FORWARD);
cudaDeviceSynchronize();
cufftDestroy(plan);
}
3. Scaling and Copying Data to Output
After the FFT, we apply a scale factor that accounts for the Hurst parameter and time horizon. We then copy the real part of the transformed data back to a float array:
__global__ void scale_and_copy_to_output(
const cuComplex* d_data,
float* d_output,
int n,
int m,
int offset,
float hurst,
float t)
{
int out_size = n - offset;
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= m * out_size) return;
int traj_id = tid / out_size;
int idx = tid % out_size;
int data_idx = traj_id * (2 * n) + (idx + 1);
float scale = powf((float)n, -hurst) * powf(t, hurst);
d_output[tid] = d_data[data_idx].x * scale;
}
4. The Host-Side Function
Putting it all together in a single host function callable from Rust via FFI:
extern "C" EXPORT void fgn_kernel(
const cuComplex* d_sqrt_eigs,
float* d_output,
int n,
int m,
int offset,
float hurst,
float t,
unsigned long seed)
{
int traj_size = 2 * n;
cuComplex* d_data = nullptr;
cudaMalloc(&d_data, (size_t)m * traj_size * sizeof(cuComplex));
{
int totalThreads = m * traj_size;
int blockSize = 512;
int gridSize = (totalThreads + blockSize - 1) / blockSize;
fill_random_with_eigs<<<gridSize, blockSize>>>(d_data, d_sqrt_eigs, traj_size, m, seed);
cudaDeviceSynchronize();
}
{
cufftHandle plan;
cufftPlan1d(&plan, traj_size, CUFFT_C2C, m);
cufftExecC2C(plan, d_data, d_data, CUFFT_FORWARD);
cudaDeviceSynchronize();
cufftDestroy(plan);
}
{
int out_size = n - offset;
int totalThreads = m * out_size;
int blockSize = 512;
int gridSize = (totalThreads + blockSize - 1) / blockSize;
scale_and_copy_to_output<<<gridSize, blockSize>>>(d_data, d_output, n, m, offset, hurst, t);
cudaDeviceSynchronize();
}
cudaFree(d_data);
}
- Allocates device memory for
d_data
. - Uses
fill_random_with_eigs
to populate data with random complex values and multiply by the eigenvalues. - Executes cuFFT on each realization (
m
). - Scales and copies the results into
d_output
. - Frees GPU memory.
Integrating with Rust
Now, for the fun part: calling this C++/CUDA code from Rust. Projects like cudarc offer convenient Rust wrappers for CUDA functionalities, but not all features are always available out of the box. Below is a condensed version of how we can dynamically load the .so
or .dll
generated by nvcc
and call the fgn_kernel
function.
Note: We compile the
.cu
file using commands like:# Linux nvcc -shared -o libfgn.so fgn.cu -Xcompiler -fPIC # Windows nvcc -shared fgn.cu -o fgn.dll -lcufft
Rust-Side Code Snippet
fn sample_cuda(&self) -> Result<Array2<f64>, Box<dyn Error>> {
use std::ffi::c_void;
use cudarc::driver::{CudaDevice, DevicePtr, DevicePtrMut, DeviceRepr};
use libloading::{Library, Symbol};
#[repr(C)]
#[derive(Debug, Default, Copy, Clone)]
pub struct cuComplex {
pub x: f32,
pub y: f32,
}
unsafe impl DeviceRepr for cuComplex {
fn as_kernel_param(&self) -> *mut c_void {
self as *const Self as *mut _
}
}
type FgnKernelFn = unsafe extern "C" fn(
*const cuComplex,
*mut f32,
i32,
i32,
i32,
f32,
f32,
u64
);
#[cfg(target_os = "windows")]
let lib = unsafe { Library::new("src/stochastic/cuda/fgn_windows/fgn.dll") }?;
#[cfg(target_os = "linux")]
let lib = unsafe { Library::new("src/stochastic/cuda/fgn_linux/libfgn.so") }?;
let fgn_kernel: Symbol<FgnKernelFn> = unsafe { lib.get(b"fgn_kernel") }?;
let device = CudaDevice::new(0)?;
let m = self.m.unwrap_or(1);
let n = self.n;
let offset = self.offset;
let hurst = self.hurst;
let t = self.t.unwrap_or(1.0);
let seed = 42u64;
// Prepare eigenvalues on host
let host_sqrt_eigs: Vec<cuComplex> = self
.sqrt_eigenvalues
.iter()
.map(|z| cuComplex {
x: z.re as f32,
y: z.im as f32,
})
.collect();
// Transfer to device
let d_sqrt_eigs = device.htod_copy(host_sqrt_eigs)?;
let mut d_output = device.alloc_zeros::<f32>(m * (n - offset))?;
unsafe {
fgn_kernel(
(*d_sqrt_eigs.device_ptr()) as *const cuComplex,
(*d_output.device_ptr_mut()) as *mut f32,
n as i32,
m as i32,
offset as i32,
hurst as f32,
t as f32,
seed
);
}
// Copy data back to host
let host_output: Vec<f32> = device.sync_reclaim(d_output)?;
let mut fgn = Array2::<f64>::zeros((m, n - offset));
for i in 0..m {
for j in 0..(n - offset) {
fgn[[i, j]] = host_output[i * (n - offset) + j] as f64;
}
}
Ok(fgn)
}
- We define our own
cuComplex
in Rust that matches the CUDA structure, along with an implementation ofDeviceRepr
. - We load the shared library with
libloading
. - We create or select a CUDA device using
cudarc::driver::CudaDevice
. - Memory is allocated and transferred between the host and device.
- We call the
fgn_kernel
function from our.so
or.dll
. - Once the GPU finishes, we bring the result back to Rust.
Final Thoughts
At first glance, Rust + CUDA might seem daunting, especially if you come from a C++ background where CUDA integration is mature and straightforward. However, the Rust ecosystem is growing rapidly, and projects like Rust-CUDA, cudarc, and others are making GPU programming in Rust more accessible. While there are still some rough edges—like incomplete wrappers for various CUDA libraries—this approach demonstrates that it is entirely possible to write the bulk of your application in Rust while accelerating the most computationally heavy tasks with CUDA kernels.
For fractional stochastic processes and other HPC tasks, this approach offers both safety and performance. The stochastic-rs library will continue to grow, adding more specialized fractional processes and statistical models for quantitative finance and beyond. With this GPU acceleration in place, generating large trajectories of fractional Gaussian noise has become more efficient, enabling more robust simulations and model training pipelines.
If you’re a Rust enthusiast looking to do heavy GPU work, I hope this brief guide helps illuminate the path forward. And if you’re interested in fractional processes and stochastics, feel free to check out the repository for stochastic-rs and contribute to the discussion or the code.
Happy coding, and may your FFTs be fast and your threads be plenty!