From 1600 MB/s to 6300 MB/s: How we can achieve incredible file system performance?

In traditional FPGA-based Embedded Linux systems, NVMe storage performance typically stalls at ~1,600 MB/s.

Why?

Because most MPSoCs rely on:

  • PCIe Gen3 Hard IP
  • Standard Linux NVMe drivers
  • Software-based protocol processing

The result? A serious performance bottleneck that prevents modern Gen4 SSDs from reaching their true potential.

But what if embedded platforms could break that limit?

Today, we demonstrate 6,300 MB/s sustained file system write performance on a Linux-based AMD Zynq UltraScale+ MPSoC platform — fully validated, reproducible, and tested with real hardware.


🎥 Watch the Full Demonstration

See the live benchmark comparison (fio vs. io_uring-perf) and full system architecture explanation here:

👉 YouTube Demo


🔍 The Bottleneck in Conventional Systems

Using:

  • PCIe Gen3 Hard IP
  • Standard Linux NVMe driver
  • Software-heavy I/O handling

File-level write performance is typically capped at around 1,600 MB/s.

Even when switching to PCIe Gen4 SSDs, the system architecture becomes the limiting factor.

To break this barrier, optimization must happen end-to-end — from hardware interface to application layer.


🔑 The Four Keys to 6,300 MB/s

Achieving 6.3 GB/s file-system throughput requires coordinated optimization across four layers:

Diagram illustrating the four key technologies behind achieving 6,300 MB/s file system performance on embedded Linux: PCIe Gen4 Soft IP, rmNVMe-IP hardware offload engine, dual-channel DMA architecture, and optimized io_uring application for high-throughput NVMe SSD acceleration in data center environments.
Four Keys to Achieving 6300 MB/s on Embedded Linux

1️⃣ PCIe Gen4 Soft IP Embedded in rmNVMe-IP

Standard Adaptive SoCs do not provide PCIe Gen4 Hard IP.

By integrating PCIe Gen4 Soft IP directly inside rmNVMe-IP, we unlock:

  • Full Gen4 bandwidth
  • ~7,000 MB/s raw capability
  • Removal of Gen3 hardware bottleneck

This enables true Gen4 SSD performance on embedded FPGA platforms.


2️⃣ Full Hardware Offload Architecture

Traditional Linux NVMe drivers process much of the NVMe and PCIe stack in software.

Our solution:

  • Offloads NVMe protocol handling to hardware
  • Minimizes CPU overhead
  • Reduces latency
  • Improves queue efficiency

The rmNVMe-IP architecture shifts protocol processing away from the CPU and into programmable logic — where it belongs for high-throughput systems.


3️⃣ Dual-Channel High-Speed DMA (PS ↔ PL Bridge)

To match external Gen4 bandwidth internally, we designed:

  • Dual 128-bit AXI interfaces
  • Custom 256-bit aligned DMA engine
  • 1MB max I/O transfer size
  • 256 hardware queue tags per queue

This allows the system to sustain over 8,000 MB/s internal data bandwidth, preventing internal congestion.


4️⃣ High-Performance Application Using io_uring

Even optimized hardware can be limited by inefficient applications.

Standard benchmark tools like fio reached:

  • ~4,000 MB/s (filesystem write)

But our custom application built on io_uring achieved:

👉 6,300 MB/s sustained file-system write throughput

Why?

Because io_uring enables:

  • Asynchronous submission/completion rings
  • Batch I/O
  • Zero-copy data paths
  • Reduced system calls
  • Lower context switching

The result: maximum throughput with minimal CPU overhead.


📊 Real Performance Results (Validated on ZCU106 + Intel P5800X)

Raw Device (Gen4)

  • Sequential Write (fio): ~4,183 MB/s
  • Sequential Write (io_uring-perf): ~6,303 MB/s
  • Sequential Read: ~5,359 MB/s

Filesystem (ext4)

  • Sequential Write (fio): ~4,126 MB/s
  • Sequential Write (io_uring-perf): ~6,165 MB/s
  • Mixed R/W: ~2,613 MB/s sustained

These are not simulation results. They are measured, reproducible, and fully documented.


🌍 Why This Matters for Data Storage & Data Centers

Breaking the 1,600 MB/s barrier transforms embedded platforms into serious data infrastructure nodes.

✔ AI Video Analytics

Sustain multi-channel 4K/8K recording without frame loss.

Diagram showing multi-channel 4K and 8K cameras streaming high-bandwidth video data into an FPGA + Linux system for real-time processing, simultaneous video streaming, and sustained SSD recording—demonstrating 6,300 MB/s file system performance for AI video analytics without frame loss.
AI Video Analytics on FPGA + Linux

✔ 5G / MEC Infrastructure

High-speed packet logging and edge CDN caching.

Illustration of 5G MEC infrastructure showing edge servers, CDN edge computing, and virtualized network functions connected to a 5G tower—representing high-speed packet logging and real-time edge CDN caching enabled by 6,300 MB/s NVMe file system performance.
5G / MEC Infrastructure

✔ Industrial Data Acquisition

Continuous high-frequency sensor capture.

Illustration of an industrial high-speed data acquisition system capturing continuous high-frequency sensor signals and streaming them to SSD storage—demonstrating sustained 6,300 MB/s file system performance for real-time industrial monitoring and edge analytics.
Industrial Data Acquisition

✔ Edge Databases

Stable, predictable high-throughput storage without I/O bottlenecks.

Illustration of a high-performance edge database architecture with distributed nodes and analytics dashboards, representing stable and predictable 6,300 MB/s NVMe file system performance that eliminates I/O bottlenecks for real-time edge data processing.
Edge Databases

Embedded Linux is no longer limited to “lightweight” workloads.


📥 Ready for Evaluation?

Our complete demo system (PetaLinux-based) is available for download and validation on your own hardware.


📩 Contact Us

Interested in breaking your NVMe bottleneck?

👉 Contact Our Engineering Team


📂 Free Evaluation Files

Download the demo package and test it in your own lab environment:

👉 rmNVMe-IP on PetaLinux (AMD) | rmNVMe-IP Gen4 (Altera)


🔎 Learn More

Product Page

🔗 rmNVMe-IP (AMD) | rmNVMe-IP (Altera)


Technical Documents

🔗 rmNVMe-IP on PetaLinux (AMD): Instruction
🔗 rmNVMe-IP Gen4 (Altera): Datasheet | Reference Design


🤝 Official Partner Platforms

Available via:


🔥 Final Takeaway

Achieving 6,300 MB/s file-system performance on Embedded Linux is not about one trick.

It requires:

✔ Gen4-ready SSD interface
✔ Full hardware NVMe offload
✔ High-speed dual-channel DMA
✔ Application-level optimization with io_uring

When every layer is optimized — embedded platforms can deliver data-center-class storage performance.


If your edge system is still capped at 1.6 GB/s, it’s time to redesign the architecture.

Let’s unlock Gen4 performance together 🚀

Highlighting 6,300 MB/s NVMe file system performance on MPSoC platforms, featuring Gen4 NVMe acceleration, high-speed SSD storage, and a call-to-action for data storage and data center applications seeking to eliminate I/O bottlenecks.