12 - MR18 Deep Dive: Launch and Failsafe

Launching the kernel and catching failsafe on the way down

Started from the bottom now we here, started from the bottom now my whole team here

Post 10 ended with a verified binary in clean DRAM. Post 5 gave the high-level failsafe saga. This post covers the actual code: the launch trampoline, the regression that taught me to stop touching the CPU, and two competing strategies for triggering failsafe mode.

Phase 4: The Launch Trampoline

Verification done. load_and_run() writes the launch trampoline to TRAMPOLINE_ADDR and lets go:

LAUNCH_TRAMPOLINE = [
    0x0A018000,  # j 0xA0060000   -- jump to lzma-loader entry
    0x00000000,  # nop             -- branch delay slot
    0x00000000,  # nop
    0x00000000,  # nop
]

for i, word in enumerate(LAUNCH_TRAMPOLINE):
    addr = int(TRAMPOLINE_ADDR, 16) + i * 4
    ocd.cmd(f"mww 0x{addr:08x} 0x{word:08x}")

rb = ocd.cmd(f"mdw {TRAMPOLINE_ADDR}")
# ... verify word matches 0x0A018000 ...
ocd.cmd(f"resume {TRAMPOLINE_ADDR}")

Same pattern as every other trampoline: write, verify, resume. But no wait_halt, no PC check. The CPU jumps to 0xA0060000, the lzma-loader starts decompressing, and we never talk to EJTAG again. That resume is the point of no return—the CPU belongs to Linux now.

Phase 4b: The Regression That Taught Me to Let Go

The lzma-loader doesn’t zero a0-a3 before jumping to 0x80060000, so the kernel inherits register garbage from our JTAG session. My plan: HW breakpoint at the entry, halt after decompression, sanitize the registers, resume.

# THE OLD (BROKEN) APPROACH -- DO NOT USE
# 1. HW breakpoint at kernel entry
# 2. Wait ~15s for lzma decompression, halt at 0x80060000
# 3. Sanitize a0-a3 (8 register r/w while halted at HW BP)
# 4. Set second HW BP at proc_dostring, resume on running CPU

Worked on runs 1 through 3. Run 4: zero UART output. Run 5: same. The combined PRACC activity—8 register reads/writes at a HW breakpoint, then a breakpoint set plus resume on a running CPU—corrupted the EJTAG state machine.

Run 3 was the proof. I accidentally skipped Phase 4b (forgot to uncomment after a refactor). Full UART output. The kernel hit a TLBL exception from the stale a1 pointer, printed a stack trace, recovered on its own. The register residue didn’t matter at all. Fix: resume from the trampoline, start the UART thread, stop touching the CPU.

_uart_en_fn(): The Thread That Does Everything

After the trampoline fires, load_and_run() spawns _uart_en_fn() as a background thread—EN pin timing, UART reading, failsafe detection, and shell setup, all on one serial port:

def _uart_en_fn(port, baud, en_delay, en_hold, stop_ev, failsafe_ev):
    ser = serial.Serial(port, baud, timeout=0.05)
    t0 = time.time()
    en_asserted = False
    buf = ""

    while not stop_ev.is_set():
        elapsed = time.time() - t0

        if not en_asserted and elapsed >= en_delay:       # t=2s
            ser.rts = True
            en_asserted = True
        if en_asserted and elapsed >= en_delay + en_hold:  # t=42s
            ser.rts = False

        chunk = ser.read(ser.in_waiting or 1)
        if not chunk:
            continue
        buf += chunk.decode(errors="replace")

        while "\n" in buf:
            line, buf = buf.split("\n", 1)
            print(f"[UART {time.time()-t0:6.1f}s] {line}")

            if "press the [f] key" in line.lower():
                ser.write(b"f\n")

            if "/ #" in line or "failsafe" in line.lower() and "#" in line:
                ser.write(b"while true; do echo 1 > /dev/watchdog; sleep 5; done &\n")
                time.sleep(0.3)
                ser.write(b"ifconfig eth0 192.168.1.1 netmask 255.255.255.0 up\n")
                time.sleep(0.3)
                ser.write(b"telnetd -l /bin/sh &\n")
                failsafe_ev.set()

EN asserts at t=2s and holds LOW for 40 seconds—way wider than the preinit window at t=18-28. UART pattern-matches two triggers: “press the [f] key” -> send f\n, and the failsafe shell prompt -> fire the watchdog kicker, configure eth0, start telnetd. The watchdog kicker matters—the QCA9557 reboots at ~90 seconds if nothing feeds /dev/watchdog, and failsafe doesn’t always start the daemon.

trigger_failsafe_gpio(): The Hard Way (That Failed)

Before the EN pin, I tried hammering GPIO17 LOW through JTAG PRACC. In C pseudocode, each cycle of this loop does:

// pause the CPU so we can touch memory-mapped registers
cpu_halt();

// read the GPIO output-enable register, set bit 17 to make GPIO17 an output
*(volatile int *)0xb8040000 |= (1 << 17);

// write to the GPIO clear register to drive GPIO17 LOW (= "button pressed")
*(volatile int *)0xb804000c = (1 << 17);

// let the CPU keep running so the kernel can advance
cpu_resume();
sleep(1.5);  // repeat every 1.5 seconds for 25 seconds

The actual Python:

def trigger_failsafe_gpio(ocd: OCD):
    GPIO_OE   = 0xb8040000
    GPIO_SET  = 0xb8040008
    GPIO_CLR  = 0xb804000c
    BIT17     = 1 << 17

    for cycle in range(17):          # 25 seconds at ~1.5s per cycle
        ocd.cmd("halt", timeout=2.0)
        ocd.cmd("wait_halt 500", timeout=2.0)

        # Diagnostic snapshot: what are the GPIOs doing right now?
        before_oe  = ocd.cmd(f"mdw 0x{GPIO_OE:08x}")
        before_out = ocd.cmd(f"mdw 0x{GPIO_SET:08x}")

        # Set GPIO17 as output
        oe_val = parse_mdw(before_oe) | BIT17
        ocd.cmd(f"mww 0x{GPIO_OE:08x} 0x{oe_val:08x}")

        # Drive it LOW -- tried both approaches
        ocd.cmd(f"mww 0x{GPIO_CLR:08x} 0x{BIT17:08x}")   # open-drain style
        ocd.cmd(f"mww 0x{GPIO_SET:08x} 0x00000000")       # push-pull style

        # Diagnostic snapshot after
        after_oe  = ocd.cmd(f"mdw 0x{GPIO_OE:08x}")
        after_out = ocd.cmd(f"mdw 0x{GPIO_SET:08x}")

        ocd.cmd("resume")
        time.sleep(1.5)

Every 1.5 seconds: halt CPU, snapshot GPIOs, drive GPIO17 LOW, snapshot again, resume. The halt is mandatory—mdw/mww silently fail on a running target. Tried both GPIO_CLR and GPIO_SET (open-drain vs push-pull uncertainty). Register writes took effect but GPIO17 stayed HIGH at the pad—the reset supervisor IC sources 10-50 mA, the SoC manages 2-4 mA. Squirt gun versus fire hose. 17 cycles, absolutely fucking nothing.

trigger_failsafe_en(): The Simple Way (That Worked)

After the GPIO disaster I wired the ESP-Prog’s EN pin to the reset button pad and wrote this:

def trigger_failsafe_en(port=ESPPROG_UART, baud=115200):
    ser = serial.Serial(port, baud)
    time.sleep(FAILSAFE_EN_DELAY)     # 2 seconds
    ser.rts = True                     # NPN conducts -> GPIO17 LOW
    time.sleep(FAILSAFE_EN_HOLD)      # 40 seconds
    ser.rts = False                    # release
    ser.close()

Seven lines. The NPN transistor connects downstream of the supervisor at the button pad, where it actually wins the current fight. GPIO17 stays LOW for 40 seconds. Two days on trigger_failsafe_gpio(). Twenty minutes on this. Worked first try.

The Narrative Arc

Every clever approach failed. Every obvious one worked. I don’t regret any of it—_uart_en_fn() only exists because I tried all the dumb approaches first and catalogued exactly what goes wrong. You can’t learn that from a datasheet. You learn it at 3am wondering why GPIO17 won’t go LOW.