10 - MR18 Deep Dive: Loading and Flushing

What load_and_run() actually does (phases 0-2)

I been through the fire and the rain, and I came out the other side

Posts 3 and 4 gave the high-level overview of the load pipeline—flush, load, flush again, verify, launch. This deep-dive series rips open the actual Python and walks through the code. We’re starting with phases 0 through 2 of load_and_run(), which covers everything up to but not including the XOR verification passes.

Fair warning: this one is dense. If you’ve been skimming, now is when the exam material starts.

Setup: Reading the Binary and Precomputing XOR32

The first thing load_and_run() does is dead simple—read the initramfs binary and compute the expected XOR32 checksum we’ll use later for verification:

def load_and_run(ocd: OCD, _out: dict | None = None):
    with open(INITRAMFS, "rb") as f:
        file_data = f.read()
    sz = len(file_data)

    expected_xor = compute_xor32(file_data)
    print(f"[*] Expected XOR32: 0x{expected_xor:08x}  ({sz // 1024} KB)")

compute_xor32() is about as minimal as a checksum function can be. It XORs every 32-bit big-endian word in the binary, truncated to the nearest word boundary so we don’t read past the end:

def compute_xor32(data: bytes) -> int:
    """XOR all 32-bit big-endian words (truncated to word boundary)."""
    xor = 0
    for i in range(0, (len(data) // 4) * 4, 4):
        xor ^= int.from_bytes(data[i:i+4], 'big')  # grab 4 bytes, interpret as big-endian 32-bit int
    return xor & 0xFFFFFFFF

Big-endian because that’s what the MIPS CPU uses when it does lw (load word) instructions, and it’s what load_image writes over PRACC. If we computed the checksum in little-endian on the host and the CPU computed it in big-endian on the target, they’d never match. Ask me how I know.

Phase 0: D-Cache Flush BEFORE Loading

This is the critical one. Post 4 explained why—the Nandloader left dirty Cisco D-cache lines at physical 0x5FC00+ and if we don’t evict them before writing our binary, they’ll write back on top of our data later. If you need the full cache coherency explanation go read post 4, I’m not re-explaining KSEG0 vs KSEG1 again because typing that out once was enough pain for a lifetime.

Here’s the trampoline that does the actual flush—10 words of hand-encoded MIPS that live at TRAMPOLINE_ADDR (0xa0800000, uncached KSEG1). In C this is what the CPU does:

unsigned int *addr = (unsigned int *)0x80000000;
unsigned int *end  = (unsigned int *)0x80008000;
while (addr < end) {
    __invalidate_icache_line(addr);
    __writeback_invalidate_dcache_line(addr);
    addr += 8;  // +32 bytes = one cache line
}
__debugbreak();

The actual MIPS:

FLUSH_TRAMPOLINE = [
    0x3C088000,  # lui  t0, 0x8000        (set t0 to 0x80000000, start of KSEG0)
    0x3C098000,  # lui  t1, 0x8000        (set t1 to 0x80000000 temporarily)
    0x35298000,  # ori  t1, t1, 0x8000    (t1 = 0x80008000, 32KB past start = loop end)
    0xBD000000,  # cache 0x00, 0(t0)      (invalidate the I-cache line at this index)
    0xBD010000,  # cache 0x01, 0(t0)      (write back + invalidate the D-cache line at this index)
    0x25080020,  # addiu t0, t0, 32       (advance t0 by one cache line = 32 bytes)
    0x1509FFFC,  # bne  t0, t1, -4        (if t0 != t1, jump back to the cache instructions)
    0x00000000,  # nop                     (delay slot)
    0x7000003F,  # sdbbp                   (trigger debug breakpoint, OpenOCD regains control)
    0x00000000,  # nop
]

MIPS ASM Cheatsheet (click to expand)

Instruction	What it does	C equivalent
`lui rd, imm`	Load `imm` into top 16 bits of `rd`	`rd = imm << 16`
`ori rd, rs, imm`	OR with zero-extended immediate	`rd = rs \| imm`
`addiu rd, rs, imm`	Add sign-extended immediate	`rd = rs + (short)imm`
`cache op, off(rs)`	`op=0x00`: invalidate I-cache line. `op=0x01`: write back dirty data + invalidate D-cache line	no C equivalent
`bne rs, rt, off`	Branch if not equal	`if (rs != rt) goto target`
`sdbbp`	Debug breakpoint—halts CPU for JTAG	`__debugbreak()`
`nop`	No-op (fills branch delay slot)	(nothing)

The cache 0x01 instruction is the money shot—Index_WB_Invalidate writes back any dirty line and then invalidates it. The loop walks every cache index across all 4 ways (32KB total), so every single dirty line in the D-cache gets flushed to DRAM. The I-cache gets invalidated too for good measure.

Now here’s the pattern for actually running the trampoline. This exact sequence repeats twice in load_and_run() (phases 0 and 2), so pay attention—I’ll point out the repetition after:

# 1) Write every word via mww
for i, word in enumerate(FLUSH_TRAMPOLINE):
    addr = int(TRAMPOLINE_ADDR, 16) + i * 4
    ocd.cmd(f"mww 0x{addr:08x} 0x{word:08x}")

# 2) Verify every word via mdw (PRACC readback)
flush_bad = 0
for i, word in enumerate(FLUSH_TRAMPOLINE):
    addr = int(TRAMPOLINE_ADDR, 16) + i * 4
    rb = ocd.cmd(f"mdw 0x{addr:08x}", timeout=5.0)
    m = re.search(r':\s+([0-9a-fA-F]{8})', rb)  # OpenOCD returns "addr: deadbeef", grab the hex value
    got = int(m.group(1), 16) if m else None    # parse the hex string into an int (or None if mdw failed)
    if got != word:
        flush_bad += 1
if flush_bad:
    return False

# 3) Resume CPU at trampoline, wait for SDBBP halt
ocd.cmd(f"resume {TRAMPOLINE_ADDR}", timeout=5.0)
ocd.cmd("wait_halt 2000", timeout=5.0)

# 4) Check PC landed where we expect
rpc = ocd.cmd("reg pc", timeout=5.0)
m_pc = re.search(r'0x([0-9a-fA-F]+)', rpc)       # OpenOCD returns "pc: 0xa0800020", grab the address
pc_val = int(m_pc.group(1), 16) if m_pc else None  # parse it so we can compare against expected sdbbp location
if pc_val != flush_sdbbp_pc:   # expects TRAMPOLINE_ADDR + 0x20
    return False

Write, verify, resume, check PC. That’s the pattern. The trampoline is only 10 words—40 bytes—so even with PRACC’s shitty 0.04% error rate, the probability of a bit flip in 10 reads is negligible. But we verify anyway because if even one instruction is wrong the CPU could jump to god knows where instead of hitting sdbbp, and then you’re staring at a hung target with no idea why.

The PC check at the end (TRAMPOLINE_ADDR + 0x20) confirms the CPU actually executed the full flush loop and landed on the sdbbp instruction. If the PC is anywhere else, something went sideways and we bail immediately.

Phase 1: load_image

With the D-cache clean, we can safely write our binary to DRAM:

resp = ocd.cmd(f"load_image {INITRAMFS} {LOAD_ADDR} bin", timeout=180.0)

One line. One OpenOCD command. 6.9 MB of initramfs kernel shoveled through PRACC at roughly 97 KB/s. The 180-second timeout is generous—the transfer takes about 70 seconds in practice—but I’d rather wait 3 minutes than have a timeout kill a successful load because USB decided to take a nap.

After the load, we re-halt the CPU:

print("[*] Re-halting CPU after load ...")
r = ocd.cmd("halt", timeout=5.0)
ocd.cmd("wait_halt 2000", timeout=5.0)

Why? Because load_image can leave the target in a running state. OpenOCD’s documentation is wonderfully vague about when this happens, so I just always re-halt. Defensive programming, baby.

Phase 2: Post-Load D-Cache Flush (Belt and Suspenders)

Here’s where it gets paranoid. We already flushed the cache before loading. load_image writes through KSEG1 (uncached), so it shouldn’t introduce any new dirty D-cache lines. In theory, the cache is still clean. So why flush again?

Because “shouldn’t” is not “can’t.” During the PRACC write sequence, the debug unit is hammering the memory bus for 70 straight seconds. Speculative instruction fetches or pipeline stalls during PRACC operations could touch KSEG0 addresses and pull data into the D-cache. It’s unlikely—but on a CPU where I’ve already been burned by cache coherency three separate times, I’m not taking chances.

The code is literally the same pattern as Phase 0:

for i, word in enumerate(FLUSH_TRAMPOLINE):
    addr = int(TRAMPOLINE_ADDR, 16) + i * 4
    ocd.cmd(f"mww 0x{addr:08x} 0x{word:08x}")
# ... verify every word ...
# ... resume at TRAMPOLINE_ADDR ...
# ... wait for halt, check PC ...

Write, verify, resume, check PC. Same trampoline, same address, same verification, same PC check. If you’re reading the source and thinking “wait didn’t I just see this exact block 40 lines ago”—yes, yes you did. I could have refactored it into a helper function but honestly it’s clearer as inline code when you’re debugging at 2am and need to know exactly which flush failed.

After Phase 2 completes:

[+] Post-load D-cache flush done [pass]

At this point, we have 6.9 MB of OpenWrt initramfs sitting in clean DRAM with a fully invalidated D-cache. No stale Cisco lines, no speculative residue, no surprises. The binary is ready for verification (phases 3a-3c, next post) and eventually launch.

The Pattern

If you take one thing away from this post, it’s the trampoline pattern: write -> verify -> resume -> check PC. Every single trampoline execution in load_and_run() follows this exact sequence—the flush trampoline twice (phases 0 and 2), the XOR checksum program (phase 3, next post), and the launch trampoline (phase 4). The only things that change between them are the trampoline contents and the expected halt address.

It’s repetitive, it’s verbose, and it’s reliable as hell. When you’re debugging embedded systems over a transport layer that flips bits 0.04% of the time, paranoia isn’t a personality flaw—it’s engineering.