Our First Production-Ready RAG Dev Journey in Pure Rust

We’ve been dabbling in AI development for some time—not just the high-level “call this existing API” style, but the kind where we’ve had to craft solutions that can’t rely on services like OpenAI or similar online systems. Recently, we were faced with the challenge of creating our first production-ready RAG (Retrieval-Augmented Generation) solution in Rust. In this post, I’ll walk through the steps we took, the bumps in the road, and where the Rust community currently stands in this area.

The Birth of Our Rust RAG

Everything started within our rust-dd explorations, where we wanted to build an abstract RAG solution. The idea was straightforward: we’d store vector embeddings in some vector database and use standard vector similarity searches to retrieve the relevant context for our local LLM.

Initial Tech Stack

Programming Language: Rust (of course!)
Vector Database: Qdrant — chosen because it’s also written in Rust and has great Rust support.
LLM Inference: Mistral.rs — a pure-Rust competitor to llama.cpp.

For about a month, this project ran as completely open source. We built everything we needed for a functional RAG system:

A ChatGPT-like frontend
File upload capabilities, with these files then embedded into our vector DB
A local LLM that could produce the best possible answers using the retrieved context

During this phase, we used plain text for the embeddings via the official Qdrant fastembed Rust library. Initial tests looked super promising, and then… the real adrenaline kicked in.

The Pilot Request: Pressure Turned Up

We got a pilot request for an industrial solution with a 1.5-month deadline. The RAG had to handle unknown data sources, so we couldn’t rely on ChatGPT or other large, well-established models. Our resources were limited, meaning we could only realistically use 14–32B parameter models—mostly in 4-bit or 8-bit quantized form.

PDF Woes & Multilingual Chaos

We can’t share all the details, but let’s just say we had to answer queries from PDF files written in multiple languages, including one especially exotic language. Anyone who’s tried handling PDFs will know it’s an age-old headache:

The format is unstructured (hello multi-decade puzzle).
Converting it to text introduces further quirks.

We tested several strategies and eventually opted to:

Convert PDFs to Markdown (we tried AI-based PDF converters like Marker, Docling, etc., but settled on PyMuPDF for reliability).
Perform semantic chunking using a Rust text-splitting library (e.g., text-splitter or similar).

Because the documents were specialized, the smaller LLMs didn’t have enough built-in domain knowledge. In many ways, uploading the data to Qdrant was more challenging than getting the right answers out of the LLM. A lot of blog posts warn you that using PDF files in RAG solutions is a bit of a nightmare—and let’s just say we agree! If you know your data well and don’t need a universal solution, you can brute-force your way to a decent approach. But watch out for:

Headers and Footers: They can seriously mislead your vector search if they get embedded.
Chunk Size: If it’s too large and the embedding model doesn’t support big sequence lengths, some data never gets embedded properly.
Embedding Vector Dimensionality: If your chunk is too big for a smaller embedding dimension, you lose fidelity.

Suddenly, we had an endless puzzle of chunk sizes, embedding limits, and searching for that perfect sweet spot.

Multi-Model Embeddings

Because we had to handle multiple languages and specialized text, we ended up using different models, all of which we grabbed from Hugging Face. We also introduced a BERT-based model and even one that produced sparse vectors. The result was a more complex embedding pipeline than we’d initially intended, but it was necessary to capture the nuance of these specialized texts.

Hitting Snags: Mistral.rs & CUDA

As the deadline loomed, more problems cropped up:

We ran into devicemapping issues in Mistral.rs (meaning the way the model data is mapped onto the GPU’s memory had hiccups).
We hit some CUDA-related errors too.

We aren’t expert CUDA developers and didn’t have time to fix it ourselves, but huge thanks to Eric Blueher, the maintainer of Mistral.rs, who’s since addressed these issues. Unfortunately, due to our time crunch, we pivoted to Ollama-rs for a more stable framework, and Ollama ran smoothly without further GPU drama. We still intend to return to Mistral.rs eventually because we want complete control over every component, and Mistral.rs offers plenty of low-level configuration possibilities.

Python Creep: Rust-Bert & PyO3

We also had to handle BERT-based models in Rust using Rust-Bert, but it uses LibTorch under the hood. That introduced some friction around dependencies. Here came the first moment we felt that, for better or worse, we simply couldn’t avoid Python right now.

So, we went for a pragmatic approach:

Wrote a few quick Python scripts
Called them from Rust using PyO3

That gave us the embeddings we needed, and we could keep sprinting toward the finish line.

The Final Crunch

We had less than a month to go. The biggest challenge was figuring out how to return the best vector DB results to the LLM. If you have huge documents and small chunk sizes, you keep document coherence but risk missing big-picture context. If you have huge chunk sizes, you risk losing detail or the search becoming less relevant. We tried many “common sense” solutions that turned out not to work as well as we hoped.

In the end, we had to take a step back and really think about how the underlying math shapes our data. Often, you have to do more than just follow best practices; you need to deeply understand the model’s structure and how vector similarity is computed. But time was running short, so we decided to go with the best solution we had for the demo.

The Big Day: Demo & Success

The day finally arrived to present our solution. We were beyond relieved—and a bit surprised—when everything just worked:

The system gave correct English answers 85–90% of the time, with virtually no hallucinations.
Because the entire pipeline is in English, additional languages weren’t quite as accurate, but they weren’t far behind either.

All in all, we left feeling confident we’d achieved our goal of delivering a production-ready Rust-based RAG on a tight deadline.

What’s Next?

We’re already looking forward to improving the solution’s accuracy and expanding to more languages. Those many months—and the all-nighters—definitely paid off. Our plan is to:

Revisit Mistral.rs for more control.
Tweak our chunking strategy to handle huge documents more gracefully.
Experiment more with diverse embedding models to support truly multilingual and domain-specific use cases.

That’s our RAG in Rust success story—the first of many, we hope!

Final Thoughts

Working in pure Rust for a RAG solution is incredibly rewarding but also full of gotchas. Whether it’s embedding pipeline complexities, GPU quirks, or having to integrate Python, building a robust RAG is a multi-layer puzzle. But if you’ve got the passion for Rust and AI, there’s no question that the community is pushing boundaries. We’re thrilled to be part of that journey—and can’t wait to see where it leads next.

If you have any questions or insights (especially if you’re wrangling Rust, GPU inference, and complex embeddings), feel free to reach out or drop a comment. Let’s keep building the future of AI—in Rust!