Building a Hardware-Accelerated ZK Prover Core in Rust
TL;DR — Zero-knowledge proofs are bottlenecked by two operations — Multi-Scalar Multiplication (MSM) and Number Theoretic Transform (NTT) — that together consume ~90% of proving time. We are building a ZK prover core in Rust from the field-arithmetic layer up, offloading the compute-critical path to GPU via Ingonyama's ICICLE library. This post explains why that is necessary, how the ZK ecosystem is evolving to demand it, and exactly how we plan to build it.
The Proving Time Problem
Zero-knowledge proofs are one of the most consequential cryptographic inventions of the last decade. They allow one party to prove knowledge of a secret without revealing it — and they are becoming the backbone of scalable blockchains, private smart contracts, and trustless cross-chain bridges. zkSync, Scroll, Polygon zkEVM, Aztec, and StarkNet are all betting billions of dollars on this technology.
But there is a practical problem that does not appear in any whitepaper: generating a ZK proof is extraordinarily slow. On a modern CPU, proving a simple smart contract execution takes seconds to minutes. Proving a full Ethereum block takes hours. The cryptography is correct. The engineering is not yet good enough.
Here is the breakdown of proving time across Groth16, PLONK, and STARK systems:
Multi-Scalar Multiplication (MSM): ~60%
Number Theoretic Transform (NTT): ~30%
Everything else: ~10%
These numbers are consistent across major proof systems and implementations: MSM and NTT dominate the prover's runtime, accounting for roughly 90% of total proving time.
Why the ZK Ecosystem Demands Hardware Acceleration Now
The ZK space has shifted from "does this work?" to "can we run this in production?" in roughly two years. Three forces are driving demand for hardware-accelerated provers simultaneously. Real-time proving is becoming the bar zkSync Era and Scroll both target sub-minute proof generation for user transactions. Aztec's Noir compiler generates circuits that must be proved within a single block time — around 12 seconds. These are production SLAs, not research targets. Meeting them on CPU-only hardware today requires either very small circuits or very powerful machines. GPU acceleration is the only path to making them economically viable at scale. Recursion multiplies the proving load Modern ZK systems use recursive proof composition: a proof that verifies other proofs. Nova, HyperNova, and Plonky2 all do this. Recursion is powerful — it enables incrementally verifiable computation and proof aggregation. But each level of recursion adds another MSM and NTT pass. A system that was marginally fast at depth 1 becomes unusable at depth 5. Hardware acceleration does not just make things faster — it makes recursion architecturally viable. The hardware ecosystem just caught up Ingonyama released ICICLE in 2023 — a CUDA library that exposes GPU-accelerated MSM and NTT with a clean C API and growing Rust bindings. Nvidia H100 GPUs are now available on every major cloud provider. The tooling, the hardware, and the Rust FFI ecosystem have converged. The missing piece is engineers who understand all three layers: the cryptographic algorithm, the Rust memory model, and the GPU kernel behavior. That intersection is exactly what this project targets.
Why Rust — Not C++, Not Python
GPU-accelerated cryptography is almost always written in C++. CUDA is a C++ extension. Most ZK prover backends have C or C++ somewhere in the critical path. So why Rust? Memory safety without a garbage collector GPU integration requires manually managing two distinct memory spaces: the host (CPU RAM) and the device (GPU VRAM). In C++, you allocate device memory with cudaMalloc and free it with cudaFree. Forget the free, and you leak VRAM until the process exits. Call free twice, and you corrupt the CUDA context. {% callout type="warning" %} These bugs are not hypothetical — they appear regularly in production ZK proving infrastructure. {% /callout %} In Rust, you model device memory as a struct with a Drop implementation that calls cudaFree: rustpub struct DeviceBuffer { ptr: *mut T, len: usize, _marker: PhantomData, // tells borrow checker we own a T }
System Architecture The prover core is organized into three layers that correspond directly to the execution model: cryptographic primitives at the base, Rust CPU orchestration in the middle, and GPU kernels at the top.
Cryptographic Primitives The foundation is finite field arithmetic over the BN254 scalar field. A FieldElement is a stack-allocated struct implementing Add, Sub, Mul, Neg. Because it is Copy, expressions like a * b + c do not require cloning or borrowing gymnastics. The Polynomial type is a newtype wrapper over Vec — a recurring Rust idiom in ZK code that uses the type system to prevent meaningless operations while still benefiting from Vec's memory management. Layer 2 — CPU Host: Rust Orchestration + SIMD An async Tokio task graph orchestrates the proving pipeline — CPU-side work proceeds while the GPU is computing, eliminating idle time on both. For operations too small to justify a GPU round-trip, Rust's SIMD intrinsics (std::arch::x86_64) process 8 field elements per clock cycle using AVX2 instructions. Layer 3 — GPU Device: ICICLE Kernels The GPU layer runs the MSM and NTT kernels via ICICLE. The interface from Rust to ICICLE is a thin unsafe FFI layer wrapped in a safe DeviceBuffer abstraction. {% callout type="info" %} Why PhantomData is essential Without it, the Rust compiler thinks DeviceBuffer doesn't actually own any T. This breaks variance, auto-trait derivation (Send/Sync), and drop check. PhantomData is a zero-size marker that tells the borrow checker "this struct logically owns a T." Expect to explain this in a ZK engineering interview.
What's Next in This Series
Part 2 — Stack vs. heap in a ZK prover:
FieldElement,HeapBuffer<T>, and whyCopymatters for field arithmeticPart 3 — Traits as cryptographic contracts: implementing a generic
Fieldtrait and writingNTT<F: Field>Part 4 — Unsafe Rust and GPU memory:
DeviceBuffer<T>,PhantomData,Send+Sync, and whycudaFreemust run exactly oncePart 5 — Parallelizing NTT with Rayon: work-stealing,
par_chunks_mut, and measuring real speedup with criterionPart 6 — Montgomery multiplication: the algorithm, the implementation, and why it is the atom of everything above it
Part 7 — Pippenger MSM from scratch: bucket accumulation, running-sum reduction, connecting CPU implementation to ICICLE
Part 8 — The full prover: end-to-end commit, open, verify — benchmarks, flamegraphs, and lessons learned
Every concept in this series is grounded in code you can run, break, and fix. The borrow checker error you see when you try to use a moved DeviceBuffer teaches you more about GPU memory safety than any whitepaper.
Build the system. Read the errors. Write the comments explaining each one. That is the preparation that actually transfers to an interview room.