Forth for Cortex-M4 Part I: Blinkenlights

So, I decided to learn ARM assembler and the details of running bare metal on an ARM chip. In my experience, it is easier to learn something when you have some form of higher goal to strive for, and thus I set the goal of writing a Forth in ARM assembler. My thinking is that the core of a Forth should be simple enough to write in pure assembler, and then the Forth can be used to further experiment in an interactive environment.

I thought that the Cortex-M architecture would be a nice start, as this is hugely popular in embedded and IoT devices. I already had a few Adafruit Feather boards around, like a ATSAMD21G18 based Feather M0 Basic Proto (I also have a few of those with ISM and LoRa radios on them, which I can experiment further with, once I have the basic Fort up and running), as well as a ATSAMD51J19 based Metro M4 Express AirLift (which also includes an ESP32 with support for WiFi and Bluetooth).

I started out reading about the Cortex-M0, but then decided to switch to the Cortex-M4, mostly because my Metro M4 card comes with an SWD debugging connector. The Feather M0 also has SWD connectors, but only in the form of solder pads, so you would need to solder additional wires. The basics of the M0 and M4 are close enough so that I can easily adapt my Forth to run on both chips, once debugged.

It also has a lot of GPIO, SPI, I2C, crypto, etc, which will be interesting to play with, but which are not important for the basic Forth implementation.

The Cortex-M4 is an ARMv7E-M architecture and support the Thumb/Thumb-2 ISA (Instruction Set Architecture).

Doing bare metal programming is interesting, as you have no support of any operating system. This means you will need to understand the chip you are using on a fairly detailed level. The first question is “how does a program start running on the chip with no operating system or other support available”? In order to understand this, let us have a look at the bottom part of the memory layout of our Cortex-M4.

That is, the flash resides at the bottom of the address space, starting at 0x00000000 and occupying 512kB, and the RAM starts at 0x20000000 and occupies 192kB.

Further to this, the MCU expects a vector table including the address of the runtime stack followed by a number of interrupt vectors starting at address 0x00000000. Arguably, the most important of these interrupt vectors is the Reset_Handler, which is the address where execution will start after a reset.

However, the Metro M4 card comes preloaded with a bootloader called UF2 ,as well as CircuitPython. The older Feather M0, such as the Basic Proto comes with a bootloader called SAM-BA. The primarily function of the bootloader is ease how you load new firmware on the chip, allowing you to upload using USB rather than having to upload through SWD. I thus opted for keeping the bootloader, at least during development.

However, keeping the bootloader means that it will be the bootlader that resides at address 0x00000000 and gets started when the chip is reset. It turns out though, that the only thing you need to do with your own program is to place it a little bit higher up in memory; the basic layout, with the initial IRQ vector table, followed by your code, will be exactly the same. In the case of the UF2 bootlader, the first 16kB of flash will be reserved, and in the case of the SAM-BA bootloader, the first 8kB will be reserved.

This means that for the Metro M4, you will place your program starting at address 0x00004000 instead.

In order to write our first program, Blinkenlights, we also need to figure out how to control the LED on the Metro M4; the board actually has two LEDs, one simple, red LED, and one fancy, multicolored one. We will use the simple, red LED, as that is easiest to control.

The red LED on the Metro M4 is connected to I/O pin #16. All I/O pins on the Cortex-M4 are controlled through a PORT, which is a number of register where you can set the function of the I/O pins. In our case, we only need to set the direction to OUTPUT, and then toggle the pin on and off at a reasonable rate to make the LED blink.

The Cortex-M4 has two PORTs, PORTA and PORTB. For controlling pin #16, we will need PORTA, which resides at address 0x41008000. The different registers of the PORT are located at different offsets from this base address. Looking at the program listing below, the two registers that we are interested in are DIRSET to set the direction (a set bit for a particular pin means OUPUT), and OUTTGL which toggles the output of the pin. Each I/O pin is represented by one bit in most of these registers, and thus by setting bit #16 in DIRSET and then toggling bit #16 in OUTTGL, we should be able to get the LED to blink.

	.syntax	unified

	.text
	.align	2

__vectors:
	.long	__stack
	.long	Reset_Handler
	.size	__vectors, . - __vectors

	.equ	PORT, 0x41008000
	.equ	PORTA, PORT + 0x80 * 0
	.equ	PORTB, PORT + 0x80 * 1
	.equ	DIR, 0x00
	.equ	DIRCLR, 0x04
	.equ	DIRSET, 0x08
	.equ	DIRTGL, 0x0c
	.equ	OUT, 0x10
	.equ	OUTCLR, 0x14
	.equ	OUTSET, 0x18
	.equ	OUTTGL, 0x1c
	.equ	IN, 0x20
	.equ	CTRL, 0x24
	.equ	WRCONFIG, 0x28
	.equ	EVCTRL, 0x2c

	.globl	Reset_Handler
	.thumb_func
Reset_Handler:
	@@@	Set PA16 (red LED) to OUTPUT
	LDR	R0, =PORTA
	MOVS	R2, #1
	LSLS	R2, #16
	STR	R2, [R0, DIRSET]

toggle:
	STR	R2, [R0, OUTTGL]

	MOVS	R3, #1
	LSLS	R3, #20

delay:
	SUBS	R3, R3, #1
	BNE	delay

	b	toggle

.end

We start by telling the assembler that we want to use the unified syntax; Then we start the .text segment (i.e. the code). We begin with the vector table, where put the address of the stack and the address of our entry function. After that comes a number of definitions of addresses, to make it easier to work with the PORTs; you can see that each port occupies 0x80 bytes (there are more registers than the ones listed, but as we will not use them now, I have left them out).

After the vector table and definitions comes the entry point, Reset_Handler. We start by setting bit #16 in the DIRSET register. After that, we move into a loop which will toggle bit #16 in the OUTTGL register, do a busy-spin for some time and then loop back and toggle again.

In order to produce an object file from the above assembler file, we will use the GNU AS as follows:

Now we have an object file, but this is still not possible to load onto the board. We also need to invoke the linker, to resolve addresses and make sure our program is placed correctly in memory. In order to do this, we will use the following linker script (in the file samd51.ld):

Here we first declare the basic memory layout, starting with a 16kB BOOT, followed by the part for the FLASH that is available to us (i.e. the 512kB minus the 16kB reserved for the bootloader), followed by the SRAM.

We then declare the reset handler the main entry point, and follow that by the details of the different sections that will go into flash and RAM. We are actually only using the .text section, so .data and .bss are there for future use. Last we declare the stack to reside at the end of the RAM memory; it is customary to place the stack high in memory and have it grow downwards, and we will later place the heap at the beginning of RAM and have it grow upwards.

In order not to have to type the commands for assembling and linking by hand all the time, we will also use the following Makefile to automate the task:

We can now use just make to assemble and link the program, make dump to get a dump of the linked program (useful for verifying that everything gets placed at the correct addresses), and make flash to flash the program to our device.

In order to flash, though, you will need either the BOSSA tool (if you are running the SAM-BA bootloader), or uf2conv if you are running the UF2 bootloader.

To read up on the details on how to flash the Metro M4, as well as some other good to know information, I recommend you browse through the Adafruit documentation on the subject.

Once the program has been flashed to the Metro M4, and the board has been reset, you should see a blinking LED!

We now have a basic skeleton of an assembler program for the Cortex-M4 on the Metro M4 board. We can assemble and link the program into a binary, and we can flash the binary onto the board.

This concludes part 1 of our series. In the next part, we will implement the Forth inner loop. Until then, happy hacking!