On hardware state machines: How to write a simple MAC controller using the RP2040 PIOs

Hardware state machines are nice, and the RP2040 has two blocks with up to four machines each. Their instruction set is limited, but powerful, and they can execute an instruction per cycle, pushing and popping from their FIFOs and shifting bytes in and out.

The Raspberry Pi Pico does not have an Ethernet connection, but there are many PHY boards available… take a LAN8720 board and connect it to the Pico; you’re done. The firmware ? Introducing Mongoose…

Mongoose is a networking library designed to make it easy to handle many networking tasks, and across multiple platforms. It can run baremetal on a microcontroller like the RP2040, or with an RTOS (FreeRTOS, TI-RTOS), with or without a third-party TCP/IP stack (lwIP, FreeRTOS TCP); or it can run on an OS (Linux, Windows, MacOS) . It is written by and belongs to Cesanta Software; and it is also the core of Mongoose-OS.

The following is a discussion of the inner workings of a Mongoose code example. This code is open source and belongs to Cesanta, it can be found in this Github repository, from where the code snippets are taken. A higher level tutorial on the example, how to use it, and how to write a driver for Mongoose, is available at the Mongoose site. For details on connecting the hardware and using it, go to either site, or both…

Now, let’s start… As an Electronics Engineer, I’d rather do a bottom-up description, so I’ll start from where the title of this post suggests: the PIO state machines.

The RMII low-level driver

As the Ethernet physical interface is a twisted-pair cable, this reduces the burden: the software MAC controller does not need a CSMA/CD section implemented; even though negotiation is allowed, the driver works in full-duplex mode and so its task is to send and receive frames only.

Due to timing constraints and to keep hardware simple (using out-of-the-box boards) the RP2040 takes its clock from the LAN8720 chip and only the 10BASE-T mode is supported.

The RMII interface (a Reduced version of the Media Independent Interface) is synchronous, all data exchanged must be valid on the rising edge of a 50MHz master clock. This clock is fed to the RP2040 and both the CPU and the PIO state machines will run synchronous to this clock.

The interface consists of a transmit section and a receive section, in both of them data is exchanged in “di-bits” (a “2-bit nibble” or 2-bit bus), synchronous to the main clock but at 5MHz (2 bits x 5MHz = 10Mbps), with an additional signal each that indicates when data is valid. Frames are exchanged in their whole length, from the first preamble bit to the last CRC bit, LSb first.

The Ethernet preamble consists of 64 bits, 62 of them are a 5MHz square wave, that is, 1010…. changing every 100ns. The last two bits break that with two consecutive ones: 11, the sync pattern. Then the usual MAC frame follows, ending with the 32-bit FCS (Frame Check Sequence), that is the CRC (Cyclic Redundancy Check) of the whole MAC frame using a specified 32-bit polynomial. Finally, a 12-byte (9.6 μs) silence period must follow, the IFG (Inter-Frame Gap). We must honor all this when sending, and sync to that when receiving.

The RMII interface also includes a management interface, the SMI, consisting of an MDIO bidirectional data line and an MDC clock that according to the specifications must not exceed 2.5MHz. Data is synchronous to this clock with some setup and hold constraints we’ll see later.

Sending a frame

To send a frame, the state machine needs to generate the preamble, then pop data out of the PIO FIFO and shift it to placetwo bits at a time, LSb first, on the pins connected to the PHY. Once a byte has been sent, it will pop another one from the FIFO until there are no more bytes; in which case it will stop and wait for more data. In other words: wait for the FIFO to have data, send the preamble, send the data until the FIFO gets full, and loop to wait again. The data will be pushed to the FIFO with a DMA controller, and the sending process plus the preamble delay is slower than the controller filling the FIFO, so this works.

Here is the PIO assembler code to do that; as the sending loop needs one additional cycle to check for more data in the FIFO, the state machine works at 10MHz, twice the data rate, so most instructions insert a 1-cycle delay to be synchronous to an imaginary 5Mhz clock in phase with the PHY 50MHz clock:

// pins 0, 1 map to TX0, TX1, respectively, and can be set or out. Pin 2 maps to TX_EN and is side_set controlled
// Runs at 10MHz, data is output in di-bits at 5MHz for 10Mbps
.program rmii_tx
.side_set 1
.wrap_target
    set pins, 0b00      side 0          // Disable Tx
    pull                side 0          // Stall waiting for data to transmit
    set X, 29           side 0          // load counter for 30 repetitions
    set pins, 0b01      side 1  [1]     // Enable Tx and Write 0b01, start sending preamble
hdrloop:
    jmp X--, hdrloop    side 1  [1]     // "repeat" pattern 30 times, total 31 = 62-bit preamble
    set pins, 0b11      side 1  [1]     // Write 0b11, 2-bit sync; +2 = 64
dataloop:
    out pins, 2         side 1          // Write data in dibits
    jmp !OSRE dataloop  side 1          // 2, wrap when no more data in FIFO
.wrap

To send the preamble, the code presents the bits 01 (remember this is LSb first so that will be 10 in the wire) to the PHY with the TX_EN line enabled; this line is controlled with the side-set facility. The code then loops for thirty clock periods (jmp X-- also executes when X is 0), to a total of 31 times, which corresponds to 31 di-bits, 62 bits (6.2 μs). Then it sends the sync pattern, 2 bits (200 ns), total 64 bits (6.4 μs).

Logic analyzer capture showing the preamble signal generated; see the timing measurements

Finally, the code sends all data until the FIFO is empty. At that point it wraps around and stalls waiting for the FIFO to be written, condition that will be satisfied when the next frame is presented to be sent.

Logic analyzer capture of the generated signals at the RMII Tx interface, in relationship to the reference 50MHz clock. Sample rate is 500Mhz
Real-life capture of the TX_EN signal in relationship to the reference 50MHz clock. The 100MHz oscilloscope can’t properly cope with such signals; the flat cable wiring might be introducing some impedance mismatch and reflections causing that peak

State machine initialization code follows:

All pins are set as outputs, the TX pins to be controlled by the out and set instructions, and the TX_EN pin by the side-set operation. Since this machine only outputs data, both FIFOs are joined into the Tx FIFO. The machine shifts bytes LSb first.

Receiving a frame

When the PHY detects a valid carrier signal, it raises the CRS_DV line. The PIO state machine waits for this line to be high, then it tries to sync by waiting for a 01 condition on the pins, and then for a 11 condition. Once satisfied, a loop will read two bits at a time from the pins and every time a byte is filled it will be pushed to the PIO FIFO, where a DMA controller will move it to a ping-pong buffer.

// see LAN8720A datasheet: 3.4.1.1, 3.1.4.3
// pins 0, 1, 2 map to RX0, RX1, CRS_DV, respectively, and are inputs
.program rmii_rx
.wrap_target
    wait 0 pin 2  [1]   // wait for any outstanding frame to end (CRS_DV=0)
    wait 1 pin 2  [1]   // wait for a frame to start (CRS_DV=1), data may be b00 until proper decoding
    wait 1 pin 0  [1]   // sync to preamble (b01), uses 2 cycles of the 31-cycle (62-bit) preamble
    wait 0 pin 1  [1]
    wait 1 pin 1  [1]   // wait for start-of-frame indicator (b11)
dataloop:
    in pins, 2
    jmp pin dataloop    // 2, stop collecting when frame ends (CRS_DV=0)
    in pins, 2          // but account for 2.5MHz toggling at the end
    jmp pin dataloop    // 2
    mov ISR, NULL
    IRQ wait 0          // signal DMA to switch buffers and process this frame, wait to restart
.wrap

The PHY drops its CRS_DV line when it detects the end of the frame; but as it has an internal FIFO, some data may still be waiting in it. Under this condition, the CRS_DV line is toggled on nibble boundaries until the FIFO contents are sent out, so when this line goes down the code will check if it is raised again. If it was, the code will keep looping until a steady low CRS_DV line is detected. If not, the possible extra data in the IRS is cleared and discarded to be fresh for the next frame.

Detected the end of a frame, the code triggers an interrupt so the firmware can switch the DMA buffers and process the frame. Once the firmware has switched to the other buffer in the ping-pong set, it will clear the interrupt and the code will wrap and be ready to receive the next frame, so this should be done in a time shorter than the IFG.

State machine initialization code follows:

All pins are set as inputs, then the CSR_DV pin is assigned to be read by the jmp instruction. Since this machine only inputs data, both FIFOs are joined into the Rx FIFO. The machine shifts bytes LSb first.

The management interface (SMI)

In the SMI (Simple Management Interface), data is exchanged MSb first in 16-bit quantities over a bidirectional MDIO (Management Data I/O) line, synchronous to the MDC (Management Data Clock) line driven by the MAC controller.

A frame consists of a 32-ones preamble, a 16-bit control word, and a 16-bit data word, driven either by the MAC controller or the PHY depending on the control word. The whole spec is available on the Internet, so you can check it out.

The MAC controller drives the MDIO line during the preamble and most of the control word time, and in the data time when writing. When the MAC controller issues a read command, it releases the MDIO line and lets the PHY take over during the turnaround time.

Data must be stable at the rising edge of the clock, MAC controller data has a 10 ns setup and hold times constraint. The PHY must have a maximum access time of 300 ns, so with the fastest clock period of 400 ns (2.5 MHz) the usual (and safe) practice is to read the MDIO line during the second half of the low semi-cycle of the MDC signal.

Writing a register to the PHY is simple:

// pin 0 maps to MDIO, and can be set or out. Pin 1 maps to MDC (clock) and is side_set controlled
// Runs at <10MHz, data is sent at <2.5MHz
.program smi_wr
.side_set 1

start:
    set X, 31           side 0
preloop:
    set pins, 1         side 0  [1]
    jmp X--, preloop    side 1  [1]     // send preamble (4 clock cycles per bit)
hdloop:
    out pins, 1         side 0  [1]
    jmp , hdloop        side 1  [1]     // send header and data until stall

State machine initialization code follows:

Both pins are set as outputs, MDIO can be controlled by the out and set instructions, while MDC is controlled by the side-set operation. The machine shifts 16-bits MSb first.

Logic analyzer capture of the generated signals at the SMI interface when writing a register in the PHY

Reading a register from the PHY is a bit more involved, to be able to divide a 2.5MHz clock in 4 parts, we should run at 10Mhz, which is 50 MHz (the system clock driven by the PHY) divided by 5. To guarantee not to exceed that frequency it is convenient to use a divide by 6 constraint, but calculations can be safely done at 10MHz as working slowly will be safer.

// pin 0 maps to MDIO, and can be in, set or out. Pin 1 maps to MDC (clock) and is side_set controlled
// Runs at <10MHz, data is read at <2.5MHz on the second half of the low clock period
.program smi_rd
.side_set 1

start:
    set X, 31           side 0
preloop:
    set pins, 1         side 0  [1]
    jmp X--, preloop    side 1  [1]     // send preamble (4 clock cycles per bit)
    set X, 13           side 1          // ignore 2 LSb in header (turnaround)
hdrloop:
    out pins, 1         side 0  [1]
    jmp X--, hdrloop    side 1  [1]     // send header
    set pindirs, 0      side 0  [1]     // set to input, turnaround: bit 1
    set pins, 1         side 1  [1]     // 
    nop                 side 0  [1]     // bit 2
    set X, 15           side 1  [1]     // read 16 bit data
rdloop:
    nop                 side 0
    in pins, 1          side 0          // read on second half for
    jmp X--, rdloop     side 1  [1]     // 300ns max access time
.wrap_target
    set pindirs, 1      side 1          // drive bus high again, "wait"
.wrap

Perhaps the tricky part is to change the pin direction in the middle of the transfer, see how the code writes to pindir to set the pin as an input and then at the end sets it back as an output. Following is the state machine initialization that makes this possible, see how both write and read refer to the same pin:

The MDC pin is set to be controlled by the side-set operation as before, but the MDIO pin is not only configured to be controlled by the out and set instructions but also to be read by the in instruction. The machine shifts 16-bits MSb first.

Logic analyzer capture of the signals at the SMI interface when reading a register in the PHY

The Mongoose driver

Mongoose’s TCP/IP stack driver API is simple and elegant. As can be seen on this tutorial, it just requires writing a set of four functions (send, receive, interface is up, initialize) and there is a lock-free queue that can be used to decouple asynchronous environments like interrupt contexts.

Interface with the PHY

This is very simple and basic, as all the work is done by the PIO state machines. The functions prepare a 16-bit half-word containing the command to be written to the PIO FIFO and initialize the corresponding state machine.

The write function then writes command and data to the FIFO, while the read function writes the command and waits for the data to be pushed back by the state machine:

Sending a frame

To send a frame, the driver checks any former DMA operation has finished, by blocking on the DMA channel. Then, the data is copied to the send buffer (just in case the frame length is less than the Ethernet minimum, the buffer’s first 60 bytes are previously filled with zero and the length is trimmed to be 60 or more bytes). After this, the CRC is calculated and appended to the frame. Finally, a minimal wait enforces that the IFG is not violated and then the buffer is handed to the DMA channel to be sent.

Receiving a frame

As said above, the PIO state machine synchronizes to the preamble and detects the end of the frame, generating an interrupt.

The corresponding interrupt handler first reads the DMA channel to get the number of bytes left to be transferred; as the transfer was initialized to the maximum possible frame length, subtracting from that number gives the received frame length.

Then, the DMA channel is stopped, and reinitialized with the complementary buffer in the ping-pong set; then the PIO state machine gets its interrupt request acknowledged so it can wrap and continue receiving frames.

The received frame is then pushed to the lock-free queue as soon as possible as this is IRQ handler context.

Later, when the Mongoose event manager calls its built-in TCP/IP stack, it in turn will call this driver receive function.

This function checks the lock-free queue. If there is an outstanding frame, it is retrieved and its CRC checked; if it is correct, then the frame is delivered to Mongoose to be processed

Checking the link state

In this function, the driver asks the PHY for its status, reading the BSR (Basic Status Register), which contains the link state information that will be returned to Mongoose:

Initialization

The driver initialization code checks if the configuration specifies a queue size, otherwise it configures a default size of 8192 bytes.

Then the PIO state machine programs are loaded into the PIO memory areas. The RMII data machines and the SMI transmit machine take a big portion of the PIO0 memory; unfortunately there is not enough room left for the SMI receive machine, so this one uses PIO1:

Then, DMA is configured for the RMII data machines, and these are initialized.

The RMII data transmit DMA channel is configured to read from memory with auto increment and always write to the same address (the transmit machine FIFO). Transfer size is configured in bytes, and the machine is initialized and started, so it will stay waiting for data to be available in its FIFO to start sending a frame:

The PHY is initialized next, auto-negotiation is enabled but only for 10Mbps, this allows working with switches and (good old) hubs. A sleep call is inserted between calls so the PIO state machine is able to stop after the data byte (observing its FIFO empty) and wrap around its program:

Finally, the RMII receive DMA channel is initialized to read from a fixed address (the FIFO) and write with autoincrement to memory. Transfer size is also configured in bytes, and the channel is started right away with a maximum frame size transfer length. Then the PIO IRQ is configured and enabled, and then the PIO state machine is enabled; it will start the DMA data transfer when it synchronizes and starts pushing bytes to its receive FIFO:

Can this be improved ?

It always can…While writing this post, some possible improvements came to mind, please excuse me for brainstorming out loud. I wonder if we’ve already hit the sweet spot in the law of diminishing returns, but exercising our minds is always good.

The receiver could certainly be improved.

We could have two machines running the same program (no extra PIO memory usage), synced by an IRQ flag, let’s call it ‘x‘:

  wait irq x        // wait for the start order
  ...
  irq x nowait      // signal the other state machine to take over
  irq 0 wait        // signal the cpu there is an outstanding frame

Both machines would run the same program, but they would be configured with two different buffers. When the first machine finishes receiving a frame, it raises the internal IRQ flag and the second one takes over. The former then signals the system and waits for the CPU to acknowledge. Once the CPU has read the buffer, it will wrap and wait for the second machine to finish receiving to take over again. Let’s devise later how to start only one of the machines when the system starts up… probably starting at other address…

This would eliminate the 12μs constraint on switching DMA buffers, but a frame still must be dealt with before the other buffer fills up…

The next step would probably be to use more state machines, with more IRQ flags (which means different code but probably a jmp will do); for example, having four buffers will give us more time to handle a frame. Or, we could just have these two machines and a ping-pong buffer on each one, that would give us a total of four buffers; having to fill three of them would probably give us enough time to deal with a frame, and the two-machine strategy has already eliminated the buffer switching constraint.