14 - MR18 Deep Dive: Bare-Metal C
Contents
Writing a PHY driver without libc
I don’t need a feature, I just need a fix / raw register writes and a couple syscall tricks
Post 6 mentioned a standalone C program that pokes two MDIO registers to fix the AR8035 PHY’s broken RGMII RX clock delay. This deep dive rips open the actual source—ar8035_start.S, ar8035.c, and the Makefile—and walks through every layer from the assembly entry point down to the register writes that make Ethernet work. If you’ve ever wondered what bare-metal Linux userspace programming looks like on MIPS with zero library dependencies, this is it.
ar8035_start.S: Your Own _start
When you pass -nostartfiles to GCC, you lose crt0.o—the C runtime startup that normally sets up the stack, calls constructors, and eventually invokes main(). You’re responsible for everything. My replacement is 28 lines of MIPS assembly:
.set noreorder
.set nomips16
.globl _start
.type _start, @function
_start:
lui $gp, %hi(_gp)
addiu $gp, $gp, %lo(_gp)
li $t0, ~7
and $sp, $sp, $t0 # align stack to 8 bytes
la $t9, ar8035_main
jalr $t9
nop
move $a0, $v0 # return value from ar8035_main
li $v0, 4001 # sys_exit
syscall
MIPS ASM Cheatsheet (click to expand)
| Instruction | What it does | C equivalent |
|---|---|---|
lui rd, imm |
Load imm into top 16 bits of rd |
rd = imm << 16 |
addiu rd, rs, imm |
Add sign-extended immediate | rd = rs + (short)imm |
li rd, imm |
Load immediate (pseudo-instruction, assembler picks lui/ori or addiu) |
rd = imm |
la rd, label |
Load address of a label (pseudo-instruction, assembler emits lui/addiu pair) |
rd = &label |
and rd, rs, rt |
Bitwise AND | rd = rs & rt |
move rd, rs |
Copy register (pseudo-instruction, actually addu rd, rs, $zero) |
rd = rs |
jalr rs |
Jump and link register—jump to address in rs, save return address in $ra |
$ra = PC+8; goto *rs |
syscall |
Trigger a kernel syscall. $v0 = syscall number, $a0-$a3 = arguments |
syscall($v0, $a0, ...) |
nop |
No-op (fills branch delay slot after jalr) |
(nothing) |
Registers: $gp = global pointer (for accessing globals). $sp = stack pointer. $t0-$t9 = temporaries. $a0-$a3 = function/syscall arguments. $v0 = return value / syscall number. $t9 = used for function calls in PIC code (convention). $ra = return address.
.set noreorder tells the assembler to not reorder instructions around branch delay slots—I’ll manage those myself, thanks. .set nomips16 prevents the assembler from emitting compressed MIPS16 instructions. The lui/addiu pair loads the global pointer, which GCC’s generated code needs for accessing global variables. The stack alignment mask (~7) forces $sp to an 8-byte boundary—the MIPS O32 ABI requires this and the kernel’s initial stack might not comply.
After ar8035_main() returns, its return value sits in $v0. We move it to $a0 (first syscall argument), load syscall number 4001 (sys_exit), and syscall. That’s it. The program exits with whatever status code the C function returned. No atexit handlers, no stdio flushing, no destructors. Just gone.
Raw Syscalls: The MIPS O32 Way
Here’s where it gets interesting. x86 Linux syscalls return negative values for errors (e.g., -ENOENT). MIPS O32 is completely different. The kernel signals errors by setting $a3 to 1 and putting the positive errno value in $v0. If $a3 is 0, $v0 is the success return value. Every single syscall wrapper has to check $a3.
But first, there’s a nastier problem. GCC has no idea that a syscall instruction clobbers temporary registers. The MIPS O32 calling convention says $t0-$t9 are caller-saved, but syscall isn’t a “call” from GCC’s perspective—it’s inline assembly. If GCC has a live value in $t4 across a syscall, the kernel will happily overwrite it and GCC will use the corrupted value afterward. The fix:
#define SYSCALL_CLOBBERS \
"$t0","$t1","$t2","$t3","$t4","$t5","$t6","$t7","$t8","$t9","memory"
Every inline asm volatile("syscall") declares all temporaries as clobbers. Without this macro, the program works fine at -O0 (GCC spills everything to the stack) and silently corrupts data at -O2 (GCC keeps values in registers). Ask me how I found that one.
The syscall wrappers all follow the same pattern. Here’s sys_write:
static long sys_write(int fd, const void *buf, unsigned long len) {
register long r __asm__("v0") = 4004; // pin r to $v0, set to write's syscall number
register long a0 __asm__("a0") = fd; // pin a0 to $a0, first argument (file descriptor)
register long a1 __asm__("a1") = (long)buf; // second arg (pointer to data)
register long a2 __asm__("a2") = (long)len; // third arg (number of bytes)
register long a3 __asm__("a3"); // will hold the error flag after syscall
asm volatile("syscall"
: "+r"(r), "=r"(a3) // $v0 is read+write (in: syscall#, out: return val), $a3 is write-only
: "r"(a0), "r"(a1), "r"(a2) // $a0-$a2 are read-only inputs
: SYSCALL_CLOBBERS); // tell GCC the kernel trashes $t0-$t9
return a3 ? -r : r; // if $a3=1 (error), return -errno; else return the result
}
The register long r __asm__("v0") syntax pins a C variable to a specific MIPS register. The constraints tell GCC that $v0 is both input (syscall number) and output (return value), $a3 is output-only (error flag), and $a0-$a2 are inputs. sys_socket, sys_ioctl, and sys_exit follow the same pattern with different syscall numbers. Four syscalls total—that’s all this program needs.
Helpers: Everything from Scratch
No libc means no memset, no strcpy, no printf. I wrote minimal replacements for everything:
memset_z: fills N bytes with zeros. One function, one use case, no reason to generalize.str_copy: copies a string into a buffer. Notstrcpy—I named it differently so I’d never confuse it with the real thing.print/println: wrappers aroundsys_write(1, ...)with manualstrlen.printlnappends a newline. Riveting stuff.print_hex16: prints a 16-bit value as0xNNNN. Nibble extraction loop, lookup into"0123456789abcdef".print_dec: prints an unsigned integer in decimal. Divides by 10 in a loop, stores digits in reverse, then prints. The kind of function you write in CS 101 except this one runs on a real MIPS CPU with no division hardware support (soft-float means soft-divide too).
All of these exist because I needed diagnostic output. When you’re debugging a PHY register fix over a serial console with no other tools, being able to print “PHY ID: 0x004d 0xd072” is the difference between knowing what’s happening and staring at silence.
MDIO Access: Talking to the PHY
The AR8035 PHY’s registers are accessed through MDIO—Management Data Input/Output, a two-wire serial bus between the MAC and PHY. Linux exposes MDIO through socket ioctls on any AF_INET SOCK_DGRAM socket. You don’t actually send any packets—you just need a socket file descriptor to hang the ioctl off of. Weird, but that’s the Linux networking subsystem for you.
static int mdio_read(int fd, int phy, int reg) {
struct mii_ioctl_data *mii = (void *)&ifr.ifr_data;
mii->phy_id = phy;
mii->reg_num = reg;
if (sys_ioctl(fd, SIOCGMIIREG, (long)&ifr) < 0) return -1;
return mii->val_out;
}
SIOCGMIIREG reads a PHY register. SIOCSMIIREG writes one. But the AR8035 has more registers than the standard 32 that MDIO can address directly. The extended registers are accessed through an indirect debug mechanism—write the debug register address to MDIO register 0x1D, then read or write the value through MDIO register 0x1E:
static int dbg_read(int fd, int phy, int dbg_reg) {
if (mdio_write(fd, phy, 0x1D, dbg_reg) < 0) return -1;
return mdio_read(fd, phy, 0x1E);
}
dbg_write and dbg_mask (read-modify-write) follow the same pattern. The debug register interface is documented in the AR8035 datasheet but good luck finding it—it’s buried in a section about “vendor-specific” registers with no clear cross-reference from the main register table.
ar8035_main(): The Actual Fix
The main function does six things:
- Creates an
AF_INETSOCK_DGRAMsocket (just for the ioctl, remember). - Queries the PHY address via
SIOCGMIIPHY—it’s 3 on this board. - Reads PHY ID registers 0x02 and 0x03. Expected values:
0x004dand0xd072. That’s the Qualcomm OUI and the AR8035 model number. If they don’t match, we’re talking to the wrong chip and the program aborts immediately. - Disables hibernation: debug register
0x0B, clear bit 15 (PS_HIB_EN). This was defensive—hibernation was already disabled on my board, but some AR8035 revisions default to hibernation enabled, which causes the PHY to power down the link after inactivity. Better safe than debugging a phantom link-drop at 3am. - Enables RGMII RX clock delay: debug register
0x00, set bit 15. This is THE fix. This one bit adds the 2 nanosecond delay on the RX clock that the MAC needs to sample data correctly. Without it, every received frame has FCS errors andrx_packetsstays at zero forever. - Reads BMSR (register 0x01) and prints link status and capabilities for diagnostics.
Step 5 is the entire reason this program exists. One register, one bit, 2 nanoseconds. Everything else is infrastructure to get to that single bit flip.
The Makefile: Paranoid Compiler Flags
CC = mips-linux-gnu-gcc
CFLAGS = -O2 -msoft-float -mno-abicalls -fno-pic -Wall
LDFLAGS = -nostdlib -nostartfiles -static -e _start
Let me explain every flag because each one is load-bearing:
-O2: optimizations on.-O0produces code that’s 3x larger and slower. We want small.-msoft-float: the AR9344 SoC has no floating point unit. Any float instruction—add.s,lwc1, anything touching the FPU—is an illegal instruction trap. The kernel would normally emulate these in software, but we’re not doing any floating point anyway so just tell the compiler not to emit them.-mno-abicalls -fno-pic: no position-independent code, no$gp-relative addressing for function calls. We’re a static binary loaded at a fixed address. PIC overhead is wasted code.-nostdlib -nostartfiles: no libc, no crt0. We provide everything.-static: no dynamic linking. There’s nold.soin failsafe mode anyway.-e _start: entry point is our assembly_start, not the default_startfrom crt0.
The Makefile also has a Docker fallback for cross-compilation—if mips-linux-gnu-gcc isn’t installed on the host, it spins up a Debian container with the MIPS cross toolchain. Because not everyone has a MIPS cross-compiler just lying around. (I do now. I’ve used it way more than I ever expected.)
The final binary is 5,592 bytes. A socket, two MDIO register writes, some diagnostic output, a custom entry point, and raw syscalls. No libc, no dynamic linking, no dependencies. Just a static MIPS ELF that fixes a 2-nanosecond timing problem on an Ethernet PHY. The most overengineered two-register-write program I’ve ever written, and I’m unreasonably proud of it.