From 1600 MB/s to 6300 MB/s: How we can achieve incredible file system performance?
In traditional FPGA-based Embedded Linux systems, NVMe storage performance typically stalls at ~1,600 MB/s.
Why?
Because most MPSoCs rely on:
- PCIe Gen3 Hard IP
- Standard Linux NVMe drivers
- Software-based protocol processing
The result? A serious performance bottleneck that prevents modern Gen4 SSDs from reaching their true potential.
But what if embedded platforms could break that limit?
Today, we demonstrate 6,300 MB/s sustained file system write performance on a Linux-based AMD Zynq UltraScale+ MPSoC platform — fully validated, reproducible, and tested with real hardware.
🎥 Watch the Full Demonstration
See the live benchmark comparison (fio vs. io_uring-perf) and full system architecture explanation here:
🔍 The Bottleneck in Conventional Systems
Using:
- PCIe Gen3 Hard IP
- Standard Linux NVMe driver
- Software-heavy I/O handling
File-level write performance is typically capped at around 1,600 MB/s.
Even when switching to PCIe Gen4 SSDs, the system architecture becomes the limiting factor.
To break this barrier, optimization must happen end-to-end — from hardware interface to application layer.
🔑 The Four Keys to 6,300 MB/s
Achieving 6.3 GB/s file-system throughput requires coordinated optimization across four layers:

1️⃣ PCIe Gen4 Soft IP Embedded in rmNVMe-IP
Standard Adaptive SoCs do not provide PCIe Gen4 Hard IP.
By integrating PCIe Gen4 Soft IP directly inside rmNVMe-IP, we unlock:
- Full Gen4 bandwidth
- ~7,000 MB/s raw capability
- Removal of Gen3 hardware bottleneck
This enables true Gen4 SSD performance on embedded FPGA platforms.
2️⃣ Full Hardware Offload Architecture
Traditional Linux NVMe drivers process much of the NVMe and PCIe stack in software.
Our solution:
- Offloads NVMe protocol handling to hardware
- Minimizes CPU overhead
- Reduces latency
- Improves queue efficiency
The rmNVMe-IP architecture shifts protocol processing away from the CPU and into programmable logic — where it belongs for high-throughput systems.
3️⃣ Dual-Channel High-Speed DMA (PS ↔ PL Bridge)
To match external Gen4 bandwidth internally, we designed:
- Dual 128-bit AXI interfaces
- Custom 256-bit aligned DMA engine
- 1MB max I/O transfer size
- 256 hardware queue tags per queue
This allows the system to sustain over 8,000 MB/s internal data bandwidth, preventing internal congestion.
4️⃣ High-Performance Application Using io_uring
Even optimized hardware can be limited by inefficient applications.
Standard benchmark tools like fio reached:
- ~4,000 MB/s (filesystem write)
But our custom application built on io_uring achieved:
👉 6,300 MB/s sustained file-system write throughput
Why?
Because io_uring enables:
- Asynchronous submission/completion rings
- Batch I/O
- Zero-copy data paths
- Reduced system calls
- Lower context switching
The result: maximum throughput with minimal CPU overhead.
📊 Real Performance Results (Validated on ZCU106 + Intel P5800X)
Raw Device (Gen4)
- Sequential Write (fio): ~4,183 MB/s
- Sequential Write (io_uring-perf): ~6,303 MB/s
- Sequential Read: ~5,359 MB/s
Filesystem (ext4)
- Sequential Write (fio): ~4,126 MB/s
- Sequential Write (io_uring-perf): ~6,165 MB/s
- Mixed R/W: ~2,613 MB/s sustained
These are not simulation results. They are measured, reproducible, and fully documented.
🌍 Why This Matters for Data Storage & Data Centers
Breaking the 1,600 MB/s barrier transforms embedded platforms into serious data infrastructure nodes.
✔ AI Video Analytics
Sustain multi-channel 4K/8K recording without frame loss.

✔ 5G / MEC Infrastructure
High-speed packet logging and edge CDN caching.

✔ Industrial Data Acquisition
Continuous high-frequency sensor capture.

✔ Edge Databases
Stable, predictable high-throughput storage without I/O bottlenecks.

Embedded Linux is no longer limited to “lightweight” workloads.
📥 Ready for Evaluation?
Our complete demo system (PetaLinux-based) is available for download and validation on your own hardware.
📩 Contact Us
Interested in breaking your NVMe bottleneck?
👉 Contact Our Engineering Team
📂 Free Evaluation Files
Download the demo package and test it in your own lab environment:
👉 rmNVMe-IP on PetaLinux (AMD) | rmNVMe-IP Gen4 (Altera)
🔎 Learn More
Product Page
🔗 rmNVMe-IP (AMD) | rmNVMe-IP (Altera)
Technical Documents
🔗 rmNVMe-IP on PetaLinux (AMD): Instruction
🔗 rmNVMe-IP Gen4 (Altera): Datasheet | Reference Design
🤝 Official Partner Platforms
Available via:
🔥 Final Takeaway
Achieving 6,300 MB/s file-system performance on Embedded Linux is not about one trick.
It requires:
✔ Gen4-ready SSD interface
✔ Full hardware NVMe offload
✔ High-speed dual-channel DMA
✔ Application-level optimization with io_uring
When every layer is optimized — embedded platforms can deliver data-center-class storage performance.
If your edge system is still capped at 1.6 GB/s, it’s time to redesign the architecture.
Let’s unlock Gen4 performance together 🚀

