DE10-Nano: Multiprocessing application (AMP system)


Introduction


The DE10-Nano board contains a dual core ARM Cortex-A9 processor, which we can utilise with multiprocessing capable programs. So far we have only been running uniprocessing programs on core 0 (CPU 0). It would be nice if we can execute a second program on core 1 (CPU 1).

The two well-known systems that makes use of multiple processors is: Symmetric Multi-Processing (SMP) and Asymmetric Multi-Processing (AMP). We are going to look at the AMP system where we run independent applications on each core.

Boot flow


Let's have a look at the boot flow which uses U-Boot as the bootloader (note, for simplicity some steps may be omitted or simplified):

Single core boot flow
Single core boot flow
  1. On powerup or reset, only core 0 is enabled, and core 1 is held in a reset state (paused)
  2. The HPS will remap the boot ROM (located at addresses 0xFFFD0000 - 0xFFFEBFFF) to memory address 0x0, and releases core 0 from reset
  3. Core 0 starts executing instructions from address 0x0 - this means boot ROM program executes
    • The boot ROM program reads the BSEL switches to determine the boot source, and reads sector 0 (512 bytes) from it
    • If the MBR partition signature (0x55AA) exists, it switches into MBR partition mode, else it switches into raw mode
    • In partition mode, the boot ROM searches for an A2 partition, if exists, loads the preloader program at the start of the A2 partition and jumps to it
  4. The preloader program is U-Boot SPL in this example, and it is now executing. U-Boot SPL configures the HPS, remaps the lower memory address 0x0 (was the boot ROM program) to DDR-3 SDRAM then executes main U-Boot
  5. U-Boot then loads and executes the desired user application, e.g. standalone application or an OS such as Linux

For multi-core booting, we need to change the boot flow a little bit:

Multi-core boot flow
Multi-core boot flow

In this instance, we want to run two applications concurrently, app1 to run on core 0, and app2 to run on core 1. We load both applications into memory, start app1, and have app1 release core 1 from the reset state by writing a register. When core 1 is released from reset, core 1 execution starts from address 0x0, which means the app2 we loaded earlier will execute.

Data cache coherency


In a multi-processing system, usually we want to share data or messages between the applications running on the processors (aka Inter-process communication (IPC)). This require shared data to be coherent, i.e. changes on shared data must be updated so that both cores are able to see the same data.

This diagram (from the Cyclone V HPS Technical Reference Manual) illustrates the different components within the cores:

Block diagram of the Cortex-A9 MPU subsystem
Block diagram of the Cortex-A9 MPU subsystem

The diagram shows that each core have their own MMU and L1 cache (Instruction & Data Cache) - these two components, together with the SCU and the L2 cache (if enabled) affect the data coherency between the two cores.

L1 data cache (L1 d-cache) coherency

To keep data coherent between the two cores we can apply one of these options:

  1. On both cores, setup MMU table with the concerned memory regions (entries) and assign attributes to enable L1 cache and set these attributes: memory_type to NORMAL and shared_type to SHARED, then enable SMP coherency support (SMP participation and MMU cache broadcasting), and enable the Snoop Control Unit (SCU)
  2. On both cores, setup MMU table with the concerned memory regions (entries) and assign attributes to disable L1 cache: inner_cachability_type to NON-CACHEABLE, or memory_type to DEVICE or STRONGLY ORDERED
  3. Disable L1 data cache
  4. Clean L1 data cache everytime we make a change on the shared data

Note, the SMP coherency support can be applied in an AMP system - but naturally you are required to have the MMU table set with the same memory regions and attributes on both cores.

L2 data cache coherency

The L2 cache is a separate component (ARM CoreLink Level 2 Cache Controller L2C-310), which is outside of the cores (grey coloured boxes).

Similar to L1 cache, to keep data coherent between the two cores we can apply one of these options:

  1. On both cores, setup MMU table with the concerned memory regions (entries) and assign attributes to enable L2 cache and set these attributes: memory_type to NORMAL and shared_type to SHARED to enable L2 caching, then enable SMP coherency support (SMP participation and MMU cache broadcasting), and enable the Snoop Control Unit (SCU)
  2. On both cores, setup MMU table with the concerned memory regions (entries) and assign attributes to disable the L2 cache: outer_cachability_type to NON-CACHEABLE, or memory_type to DEVICE or STRONGLY ORDERED
  3. Disable L2 cache
  4. Clean L2 cache everytime we make a change on the shared data
L1 & L2 data cache coherency for external components (e.g. FPGA IPs, F2H bridge, etc)

The SCU, ACP (Accelerator Coherency Port) and ACP ID Mapper components maintain L1 & L2 cache data coherency for AXI master peripherals that are external components to the processors, but is out of scope here.

*Note, the coherency does not support the instruction cache (L1 i-cache) because none of these components have direct access to it.

"Hello, World!" AMP example


I've made a bare-metal C, AMP version of the "Hello, World!" example, which demonstrates a basic AMP application. It consists of two separate "Hello, World!" programs, one for each of the cores. Download it from helloworld_amp.

C startup files for AMP system


I had to create my own linker and startup files.

Core 0 startup files in app1 source code:

File Description
ldscript/tru_c5_ddr_core0.ld GNU linker script
startup/tru_config.h, startup/startup.c Startup file (normal)
startup/tru_config.h, startup/startup_etu.c Startup file (exiting to U-Boot)

Core 1 startup files in app2 source code:

File Description
ldscript/tru_c5_ddr_core1.ld GNU linker script
startup/tru_config.h, startup/startup.c Startup file

Linker file settings


In an AMP system, each application should have their own memory space, which in my example is managed by the GNU linker files. You will find user settings near the beginning of linker files.

RAM size for each core

These settings configure the memory sizes for each core, i.e. application. Set them to your requirements, but ensure _CORE1_RAM_SIZE is set with the same value inside both linker files.

In app1 linker file ldscript/tru_c5_ddr_core0.ld we have:

__CORE1_RAM_SIZE = 64M;
__CORE0_RAM_SIZE = 64M;

In app2 linker file ldscript/tru_c5_ddr_core1.ld we have:

__CORE1_RAM_SIZE = 64M;
RAM starting address for each core

These settings do not need changing.

For core 0, the starting address is auto calculated:

__CORE0_RAM_BASE = __CORE1_RAM_BASE + __CORE1_RAM_SIZE;

For core 1, the starting address must start at 0x0, which is already preset:

__CORE1_RAM_BASE = 0x0;
Stack size for each core

My startup code make use of Intel's HWLib for setting up the MMU table, but their code will use a large local array so we must set the user stack (which is also used as the system stack) with space of more than 4096 bytes. Also note, depending on defines, HWLib may re-setup the stack pointers - see the alt_interrupt.c file.

Stack settings used by my startup code in both linker files:

__SYS_STACK_SIZE = 8192;
__UND_STACK_SIZE = 4096;
__ABT_STACK_SIZE = 4096;
__SVC_STACK_SIZE = 4096;
__IRQ_STACK_SIZE = 4096;
__FIQ_STACK_SIZE = 4096;

Startup


In the linker file, the ENTRY command is set to the reset_handler() function, which makes it the first function that starts. This function is located inside the startup files.

Depending on the settings, the startup file initialises the following (note, some are conditional):

  • Setup stack pointers
  • Setup access permissions
  • Setup vector table
  • Setup NEON SIMD extension
  • Setup MMU (Memory Management Unit)
  • Setup L1 caches (level 1 instruction and data caches)
  • Setup SMP coherency support
  • Setup L2 cache
  • Setup SCU (Snoop Control Unit)
  • Call newlib _startup() function (or call its alias _mainCRTStartup())

Note, newlib's _startup() initialises the .bss section (initialised data) and also sets up the stack pointers for us. I've overridden newlib's weak _stack_init() so that it only sets up the system and user stack pointer.

Debugging with Eclipse GDB/OpenOCD


It seems the GDB & OpenOCD plugin for "Eclipse IDE for Embedded C/C++" does not support AMP debugging, i.e. we cannot debug both cores in a single debug session. We can only debug one core at a time. If we debug on core 0, breakpoints on the other core (core 1) are ignored, they are not captured. This is same the opposite way.

I have created some Eclipse debug launch files.

File Description
hwamp_app1x_debug.launch Debugging on core 0 only. Executes both apps.
hwamp_app1_debug.launch Debugging on core 0 only. Executes only app1 on core 0.
hwamp_app2x_debug.launch Debugging on core 1 only. Executes both apps.
hwamp_app2_debug.launch Debugging on core 1 only. Executes only app2 on core 1.

In the Eclipse project, app1 is set to depend on app2, but seems there is an Eclipse bug, in certain cases Eclipse will not rebuild or build app2! If you get a load or missing file error when launching a debug session, go into app2 project and rebuild it first.

Note, for app2 debug launch to run on core1 a software breakpoint BKPT #0 (machine code: 0xe1200070) is used. This is preset in the OpenOCD parameters within the debug launch settings.

Build with makefile script


Use my shortcut to start WSL and enter in this command to build the SD card image:

make sd=1

Note, it builds both projects - you should see two bin files in the console. Write the SD card image to a micro SD card and then boot it on the DE10-Nano, view the messages in a serial terminal program:

Hello, World! AMP serial terminal output (PuTTY)
Hello, World! AMP serial terminal output (PuTTY)

Document date: Rev 5: 07 Apr 2024 - Moved data cache coherency to a separate heading
Document date: Rev 4: 16 Mar 2024 - Corrected AXI master data coherency to L1 & L2
Document date: Rev 3: 29 Jan 2024 - Renamed linker symbols
Document date: Rev 2: 23 Jan 2024 - Correction on the L2 data coherency
Document date: Rev 1: 20 Jan 2024