DE10-Nano: FPGA to HPS data transfer


Introduction


To further explore the DE10-Nano development board I decided to start a new project. I would like to create one that streams data from the FPGA to the HPS.

These are my requirements defining the main functions of the project:

  • Transfer data from FPGA embedded RAM to HPS SDRAM
  • Transfer using the FPGA-to-HPS bridge (F2H bridge)
  • Transfer using the FPGA-to-SDRAM interface (F2S interface)
  • Find the fastest transfer throughput
  • For data transfer complete, use an interrupt triggered from the FPGA
  • Support sending parameters from the HPS to the FPGA
  • Use a hardware timer to measure transfer throughput
  • Verify the transferred data

Gathering information


Documents with useful information

After reading through many technical manuals and studying various projects, building my project became feasible. Let me review my findings with you.

Interconnecting data bus

On the Cyclone V SoC FPGA there are four types of data buses where data can cross the two fabric:

Data bus Number of ports Master data width (in bits) Slave data width (in bits) Protocol
FPGA-to-HPS bridge (F2H bridge) 1 32, 64, or 128 64 AXI-3
FPGA-to-SDRAM interface (F2S interface) 1 to 3 (AXI-3) or
1 to 6 (Avalon-MM)
32, 64, 128, or 256 (total limited to 256) 64, 2x64, or 4x64 AXI-3 or Avalon-MM
HPS-to-FPGA bridge (H2F bridge) 1 32, 64, or 128 64 AXI-3
Lightweight HPS-to-FPGA bridge (L2F bridge) 1 32 32 AXI-3

Since we want to transfer from the FPGA to the HPS, we can use the F2H bridge and the F2S interface to write data into the HPS SDRAM. I will use the AXI-3 protocol because it is available for all data buses.

Cache maintenance

When L1 or L2 data cache are enabled we need to manage them on data transfers with the HPS SDRAM.

Data transfer over the F2H bridge and ACP

SDRAM access through the F2H bridge is cache coherent but only when used with the ACP memory map. With this method we don't need to apply any cache maintenance.

The ACP memory map is provided by the ACP ID Mapper (A HPS component) which manages a 1GB cache coherency window and also maps 12-bit AXI IDs for it to work.

The Accelerator Coherency Port (ACP) (An ARM MPCore component) integrated in the Snoop Control Unit (SCU) provides a slave port to the ACP ID Mapper. The ACP ID Mapper receives the read & write accesses from the F2H bridge and generates corresponding AXI requests to the ACP, which works with the SCU to make the actual cache coherent access with the SDRAM.

The 1 GB cache coherent window (SDRAM memory map) facing the ACP is remappable within the 4 GB Arm Cortex-A9 MPCore address space. By default it is mapped to the bottom of the SDRAM, i.e. address range: 0x00000000 to 0x3FFFFFFF.

The 1 GB cache coherent window (ACP memory map) facing the F2H bridge however is fixed, and is located at the 1 GB address range: 0x80000000 to 0xBFFFFFFF. To use the 1 GB cache coherent window we simply write into the ACP memory map range with the F2H bridge using the correct AXI-3 AxUSER and AxCACHE settings. For the details see "Accelerator Coherency Port" in the "Cyclone V Hard Processor System Technical Reference Manual" and the "Arm Cortex-A9 MPCore Technical Reference Manual".

Data transfer over the F2S interface

In comparison, the F2S interface provides a much wider data bus than the F2H bridge, but the requests are made directly to the SDRAM controller and so there is no builtin cache coherency support. The HPS must do one of the following in order to see updated data transfers in SDRAM:

  1. issue L1/L2 data cache flush/clean before a data transfer or
  2. set an area of SDRAM to be noncacheable via the Memory Management Unit (MMU) or
  3. disable L1/L2 data cache
FPGA embedded memory

FPGA data acquisition needs memory buffer capable of working at the acquisition clock frequency. The DE10-Nano does not have RAM connected to the FPGA side, but the FPGA itself has some embedded memory (internal memory).

The Cyclone V SoC FPGA model 5CSEBA6U23I7 has 553 blocks of M10K embedded memory bits, which gives total memory of: 553 * 10K = 553 * 10240 = 5662720 bits = 707840 bytes. I will be using slightly less than that because we need to align with the largest transfer size.

I will use the AXI-3 protocol to transfer data to the HPS. It is capable of performing 1-16 bursts of write transactions. We will get the highest throughput if we use the largest burst length of 16 and the largest data bus width. This table shows the largest transfer sizes for each data bus:

Data bus Transfer size per burst Transfer size of 16 bursts
F2H bridge 128 bits = 16 bytes 16 * 128 bits = 2048 bits = 256 bytes
F2S interface 256 bits = 32 bytes 16 * 256 bits = 4096 bits = 512 bytes
H2F bridge 128 bits = 16 bytes 16 * 128 bits = 2048 bits = 256 bytes
L2F bridge 32 bits = 4 bytes 16 * 32 bits = 512 bits = 16 bytes
Using the FPGA embedded memory

FPGA embedded memory can be utilised using a suitable Intel IP, available within the Platform Designer, under "Basic Functions/On Chip Memory".

Platform Designer FPGA IP OCR
Platform Designer FPGA IP OCR

Alternatively, embedded memory can be utilised in Verilog using a special synthesis attribute or a synthesis comment. For details see Quartus® Prime Pro Edition Help: ramstyle Verilog HDL Synthesis Attribute. My m10k_mem.v Verilog module implements it this way.

The attribute tells the Analysis & Synthesis to implement the register array (inferred as RAM) using embedded memory instead of logic. Note it may fail to implement using embedded memory when limitations are not met, such as, the read and write must be clocked, the register must be an array, the register must be not be larger than the available embedded memory.

Vendor IP limitations

Typically to save time, we would use the vendor's IP (Intel IP) over implementing our own Verilog logic. For data transfers these Intel IP categories can be used:

  • "Basic Functions/On Chip Memory"
  • "Basic Functions/DMA"
  • "Processors and Peripherals/Peripherals/PIO (Parallel I/O) Intel FPGA IP"

Since I want to obtain high throughput I would like to use 2048 and 4096-bit data widths, unfortunately none of the "On Chip Memory" Intel IPs support such large sizes. I will need to write my own Verilog logic for it.

There is a very nice project created by Roberto Fernández Molanes many years ago, who has designed the hardware mainly with the Intel IPs in Platform Designer CycloneVSoC-time-measurements. It will be interesting to compare the results.

Project design


To test the two data buses, it was easier to implement separate projects, one for the F2H bridge and another for the F2S interface.

FPGA hardware design

The design is implemented using Verilog and minimum Intel IPs. There is a main top.v file with the main control logic and the rest are separate module files.

This block diagram shows the flow and connections between the logic from the FPGA to the HPS for the F2H bridge + ACP design:

F2H BRAM to SDRAM transfer design diagram
F2H BRAM to SDRAM transfer design diagram

This block diagram shows the flow and connections between the logic from the FPGA to the HPS for the F2S interface design:

F2S BRAM to SDRAM transfer design diagram
F2S BRAM to SDRAM transfer design diagram

Some components are ommitted for simplification.

In the FPGA (orange coloured side) we have the following component logic:

BRAM

This represents my custom Verilog module implementing a block RAM using FPGA embedded memory. This is the memory buffer that stores the test data to be transferred to the HPS SDRAM. The block RAM is organised linearly with fixed size blocks starting from address 0.

The block size (data width) is configured to align with 16 bursts of write AXI-3 transactions with the largest data bus width as follows:

  • 128-bit F2H bridge: 128-bit * 16 bursts = 2048 bits = 256 bytes
  • 256-bit F2S bridge: 256-bit * 16 bursts = 4096 bits = 512 bytes
Parallel IO IP (PIO IP) + L2F bridge for parameters

The Intel PIO IP provides an easy way to use the L2F bridge for exchanging low bandwidth data. My FPGA design will wait for the length parameter to be non-zero and the ready flag to signal before it will actually start transferring. This IP provides memory mapped registers for the HPS to send parameters to the FPGA without us needing to work with the AXI-3 (L2F bridge) directly.

Parallel IO IP (PIO IP) + L2F bridge for ready flags

For synchronization between the FPGA and HPS I use ready flags as signals. The flags enable the HPS to indicate the FPGA when the next transfer can start and when to de-assert the interrupt trigger.

AXI-3 write helper

This is my custom Verilog module which implements a simple AXI-3 master write only controller. It will write data from a register over AXI-3. It is used here for transferring blocks of data from the BRAM to the HPS over the F2H bridge or F2S interface. Essentially achieves the same result as an Intel DMA IP.

HPS software design

The software on the HPS has a master role and the hardware design on the FPGA has a slave role. The FPGA waits for a signal before starting the transfer process. The HPS controls the transfers, when a transfer completes the FPGA triggers an interrupt, and the HPS handler executes, which measures the throughput, reads and verifies the data in the SDRAM buffer. It is interrupt and preemptive task (FreeRTOS) driven.

Parameter transfer

Parameters and flags are transmitted between the two sides over the Lightweight HPS-to-FPGA bridge (L2F bridge), accessed with the Parallel IO IP (PIO IP), which provides memory-mapped registers for the HPS.

Ready flag and synchronization

Two separate ready flags are used for acknowledgement between the two sides. They are organised as a 2 position cyclic buffer, where after a complete data transfer each side (FPGA and HPS) switches to an opposite index. This ensures that each side will always use a different ready flag, which solves the synchronisation problem of a flag being read and written at the same time.

F2S interface cache coherency

Since the F2S interface does not support cache coherency access, I have decided to set the SDRAM buffer region to non-cacheable using MMU table entries.

The process flow
  1. The compiled hardware design file is downloaded to the FPGA
  2. The HPS application is executed
  3. The FPGA fills the BRAM (M10K embedded memory) with test data
  4. The FPGA is now in a waiting state; waiting for the HPS to send parameters and the ready signal before it starts transferring
  5. The HPS application send parameters to the FPGA, starts a preemptive control task, starts a timer and asserts the ready signal
  6. The FPGA transfers data from FPGA BRAM to HPS SDRAM buffer defined by start address and size parameters received from HPS earlier
  7. When the transfer size is reached the FPGA asserts an interrupt trigger to generate one on the HPS, and then enters into a waiting state
  8. The IRQ is raised on the HPS which jumps into the interrupt handler. The handler stops the timer, deasserts the ready signal to indicate the FPGA to deassert the interrupt trigger, and then signals the preemptive control task to process the data
  9. The FPGA switches to another ready flag
  10. The control task calculates the throughput of the transfer and verifies the transferred data
  11. If the test is not yet complete, the task switches to another ready flag, and asserts the ready signal for the FPGA to start another transfer, repeating the process
Some technical details:
  • FreeRTOS provides the preemptive control task and synchronization
  • The FPGA triggers IRQ 72 (FPGA IRQ 0) interrupt
  • Arm MPCore Global Timer for measuring the data transfer. Alternatively, we can use one of the four HPS peripheral timer, but in comparison those are 32-bit counter timers vs the Arm MPCore a 64-bit counter timer.

Project files


My project is located here: fpga_to_hps.

Results


The results are surprising, I get extremely fast throughput!

F2H bridge + ACP results

Test configuration:
FPGA: 128-bit F2H bridge at 200MHz
HPS: L1+L2 caching enabled in the software
All data verification passed. The results are displayed in this chart.

F2H + ACP results chart
F2H + ACP results chart

Let's compare with the CycloneVSoC-time-measurements project. We can compare with the read transfer results from his Excel file.

The highest result from my project:
F2H bridge, 128-bit, 200MHz, 128MB: 775.01 MB/s
F2H bridge, 128-bit, 200MHz, 64KB: 706.50 MB/s

The highest result from the CycloneVSoC-time-measurements project:
F2H bridge, 128-bit, 50MHz, 64KB: 54.21 MB/s
F2H bridge, 128-bit, 100MHz, 64KB: 71.53 MB/s
F2H bridge, 128-bit, 150MHz, 64KB: 81.79 MB/s

F2S interface results

Test configuration:
FPGA: 256-bit F2S interface at 200MHz
HPS: L1+L2 caching enabled in the software and the buffer region set to non-cacheable
All data verification passed. The results are displayed in this chart.

F2S ACP results chart
F2S results chart

The highest result from my project:
F2S interface, 256-bit, 200MHz, 256MB: 1782.62 MB/s
F2S interface, 256-bit, 200MHz, 2MB: 1781.89 MB/s

The highest result from the CycloneVSoC-time-measurements project:
2x F2S interface, 2x128-bit, 50MHz, 2MB: 597.22 MB/s
2x F2S interface, 2x128-bit, 100MHz, 2MB: 984.28 MB/s
2x F2S interface, 2x128-bit, 150MHz, 2MB: 1452.90 MB/s


Document date: Rev 1: 15 Oct 2024