Inline Assembly in Rust

Rust is known for its safety and speed, making it great for building reliable software. However, sometimes you need even more control over how your code interacts with the hardware. This is where inline assembly comes in.

Inline assembly lets you write low-level CPU instructions directly in your Rust code. This is useful for:

Maximizing Performance: Fine-tune your code to run faster by optimizing critical sections.
Hardware Access: Interact directly with hardware components (e.g., device drivers, operating systems).
Special Features: Use specific processor capabilities that high-level Rust code might not fully support.

By using inline assembly, you can combine Rust’s safety and concurrency with the precise control of assembly language. This makes your programs both efficient and powerful, perfect for advanced system-level programming.

A Short Overview of Inline Assembly

When you write inline assembly in Rust, you embed assembly instructions directly into your Rust code. The asm! macro (from std::arch::asm) allows you to specify:

The instructions themselves (e.g., mov, add, sub, mul, popcnt, etc.).
Input operands (the Rust variables that get passed to assembly).
Output operands (the Rust variables that receive the results).
Options to control how the assembly is inserted by the compiler (e.g., preserves_flags, nostack, etc.).

Here is a short reference for some basic instructions (in x86_64 assembly):

mov dest, src: Copies the value in src to dest.
add dest, src: Adds src to dest and stores the result in dest.
sub dest, src: Subtracts src from dest and stores the result in dest.
imul dest, src: Multiplies dest by src and stores the product in dest.
popcnt dest, src: Counts the number of set bits (1s) in src and stores the result in dest.

Example: Simple Assembly

use std::arch::asm;

fn example_add(x: u32, y: u32) -> u32 {
    let sum: u32;
    unsafe {
        asm!(
            "mov {0}, {1}",  // sum = x
            "add {0}, {2}",  // sum += y
            out(reg) sum,
            in(reg) x,
            in(reg) y,
        );
    }
    sum
}

fn main() {
    let result = example_add(10, 20);
    println!("10 + 20 = {}", result);
}

Above, we directly move x into sum and then add y. This is a straightforward example of how to embed assembly instructions in Rust.

How to Use Inline Assembly

Below is a more detailed example to square a number:

pub fn square(x: u32) -> u32 {
    let result: u32;
    unsafe {
        asm!(
            "mov {temp}, {input}",  // Move x into temp
            "imul {temp}, {input}", // Multiply temp by x
            input = in(reg) x,
            temp = out(reg) result,
        );
    }
    result
}

unsafe {}: Inline assembly is considered unsafe because it bypasses Rust’s safety checks.
asm! macro: Contains the assembly code, along with inputs/outputs.
in(reg) x: Passes the Rust variable x into a register for use in the assembly.
out(reg) result: Reserves a register for the result, which is then stored in result after the assembly finishes.

You can string together as many assembly instructions as you like, but keep in mind:

You need to pay attention to register constraints.
You can use options(...) like preserves_flags or nostack to fine-tune behavior.

Scenarios and Use Cases

1. Interacting with Special Hardware Features

In some systems-level tasks, you might need to interact with unique hardware features or registers that standard Rust (or crates) do not expose. This is common when you’re:

Writing device drivers for specialized peripherals.
Working on low-level system code for an operating system or firmware.
Manipulating control registers for advanced CPU or chipset features.

Example: Reading the Time Stamp Counter (TSC)

The Time Stamp Counter (TSC) is a high-resolution timer that counts CPU cycles. It’s valuable for performance measurements or benchmarking. You can read it using inline assembly like this:

use std::arch::asm;

pub fn read_tsc() -> u64 {
    let tsc: u64;
    unsafe {
        asm!(
            "rdtsc",                // Read Time-Stamp Counter
            "shl rdx, 32",          // Shift the high bits to the upper part
            "or rax, rdx",          // Combine into one register
            out("rax") tsc,         // Output full TSC value
            out("rdx") _,           // Discard high bits
        );
    }
    tsc
}

fn main() {
    let tsc = read_tsc();
    println!("CPU Time-Stamp Counter: {}", tsc);
}

As you can see, the inline assembly reads the TSC into RAX and RDX, then combines them into a single 64-bit value.

Note: Rust provides intrinsics like _rdtsc on x86/x86_64 that do the same job without manual assembly, but they are still unsafe. For example:

use std::arch::x86_64::_rdtsc;

fn main() {
    unsafe {
        let tsc = _rdtsc();
        println!("Time Stamp Counter (TSC): {}", tsc);
    }
}

2. Maximizing Performance

Inline assembly can help you squeeze out every bit of performance in hot code paths. Sometimes, using specific CPU instructions can reduce the instruction count or lower latency.

Example: Fast Bit Counting with `POPCNT`

Here we compare a pure Rust approach to an inline assembly version. We use Criterion, a Rust library for microbenchmarking that provides accurate measurements and statistical analysis (like confidence intervals). This allows us to see the actual performance difference under repeated test runs.

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use std::arch::is_x86_feature_detected;

fn count_set_bits_rust(mut value: u32) -> u32 {
    let mut count = 0;
    while value != 0 {
        count += value & 1;
        value >>= 1;
    }
    count
}

#[cfg(target_arch = "x86_64")]
fn count_set_bits_asm(value: u32) -> u32 {
    use std::arch::asm;

    let count: u32;
    unsafe {
        asm!(
            "popcnt {0}, {1}",
            out(reg) count,
            in(reg) value,
            options(nostack, preserves_flags)
        );
    }
    count
}

#[cfg(not(target_arch = "x86_64"))]
fn count_set_bits_asm(_value: u32) -> u32 {
    unimplemented!("Assembly implementation is only available on x86_64 architectures.")
}

fn bench_count_set_bits(c: &mut Criterion) {
    let test_value: u32 = 0b1011_0010_1111_0001_0110_1010_1101_0111;

    if is_x86_feature_detected!("popcnt") {
        c.bench_function("Count Set Bits - Pure Rust", |b| {
            b.iter(|| {
                let count = count_set_bits_rust(black_box(test_value));
                black_box(count);
            })
        });

        c.bench_function("Count Set Bits - Inline ASM", |b| {
            b.iter(|| {
                let count = count_set_bits_asm(black_box(test_value));
                black_box(count);
            })
        });
    } else {
        println!("POPCNT instruction not supported on this CPU.");
    }
}

criterion_group!(benches, bench_count_set_bits);
criterion_main!(benches);

Criterion’s repeated measurements help eliminate noise, making it easier to compare two approaches fairly. The code above counts the bits set to 1 in a 32-bit integer. This operation is common in cryptographic, compression, and other performance-sensitive contexts, where bit manipulations can happen frequently.

Benchmark Results

Below is a simplified example of the output from Criterion. As you can see, the inline assembly version appears faster in these synthetic tests:

Count Set Bits - Pure Rust
	time:   [15.593 ns 16.827 ns 18.066 ns]

Count Set Bits - Inline ASM
	time:   [386.76 ps 404.79 ps 418.91 ps]

Because Criterion runs multiple iterations and applies statistical analysis, we get a clear view of the average execution time and how stable the measurements are. While the inline assembly version here shows a lower time, in real applications the compiler may already optimize pure Rust code to use POPCNT under certain conditions. Therefore, in practice, the performance difference might be negligible, but this example shows how inline assembly can theoretically outperform higher-level code in tight loops or niche scenarios.

Summary

Inline assembly in Rust is a powerful feature for specialized situations:

Special Hardware Access: If you need to directly manipulate device registers or use unique CPU/chipset capabilities, inline assembly offers the granularity needed. This is especially true in kernels, drivers, and embedded systems, where you might deal with memory-mapped I/O or custom registers.
Performance-Critical Sections: If there is a specific CPU instruction that can replace multiple lines of Rust, inline assembly might help you cut down overhead. Examples include instructions like rdtsc, popcnt, or advanced vector extensions.

However, keep these points in mind:

Unsafe: Inline assembly is not checked by Rust’s safety guarantees. Mistakes can lead to crashes or undefined behavior.
Architecture-Specific: The code you write for x86_64 will not directly work on ARM or other architectures, which can reduce portability.
Compiler Intrinsics: Rust’s intrinsics often provide safe, high-level wrappers for many CPU instructions. In many cases, you get the same performance without manually writing assembly.
Maintenance & Complexity: Inline assembly can be harder to read, debug, and maintain.

In most scenarios, Rust and its compiler optimizations are powerful enough to generate efficient machine code. But if you need that last drop of performance or direct hardware access, inline assembly gives you the tools to go lower-level—and achieve high performance with precise control.