11 - MR18 Deep Dive: Verification

CPU-executed XOR verification

Trust none of what you hear and less of what you see / I had to verify the data runnin through the memory

Post 10 covered phases 0 through 2—flush, load, flush again. At that point we have 6.9 MB of initramfs sitting in DRAM, and now we need to answer a simple question: is it correct? This post covers phase 3, the XOR verification pipeline, including the approach that didn’t work and the one that does.

The Checksum Program

Post 7 showed the static layout of make_checksum_program(). Here’s how it actually gets used. The function builds a 14-word MIPS program on the fly—words 0 through 3 are LUI/ORI pairs that encode the start and end addresses for the current chunk, and words 4 through 13 are constant. In C, the program the CPU runs is:

unsigned int *ptr = (unsigned int *)start;
unsigned int *end = (unsigned int *)(start + size);
unsigned int xor  = 0;
while (ptr < end) {
    xor ^= *ptr;
    ptr++;
}
*(volatile unsigned int *)RESULT_ADDR = xor;
__debugbreak();

The Python that generates the MIPS encoding:

def make_checksum_program(start_kseg1, file_size):
    end = start_kseg1 + file_size
    return [
        0x3C080000 | (start_kseg1 >> 16),      # lui  $t0, start_hi   (load upper 16 bits of chunk start into t0)
        0x35080000 | (start_kseg1 & 0xFFFF),   # ori  $t0, $t0, start_lo (fill in lower 16 bits, t0 = chunk start)
        0x3C090000 | (end >> 16),               # lui  $t1, end_hi     (load upper 16 bits of chunk end into t1)
        0x35290000 | (end & 0xFFFF),            # ori  $t1, $t1, end_lo (fill in lower 16 bits, t1 = chunk end)
        0x240A0000,   # addiu $t2, $zero, 0     (zero out t2, this is our XOR accumulator)
        0x8D0B0000,   # lw    $t3, 0($t0)      (read one 32-bit word from memory at t0 into t3)
        0x014B5026,   # xor   $t2, $t2, $t3    (XOR the word into the accumulator: t2 = t2 ^ t3)
        0x25080004,   # addiu $t0, $t0, 4       (advance the pointer by 4 bytes = one word)
        0x1509FFFC,   # bne   $t0, $t1, -4     (if t0 != t1, jump back to the lw instruction)
        0x00000000,   # nop                     (delay slot)
        0x3C0C0000 | (base >> 16),              # lui  $t4, base_hi   (load upper bits of where to store the result)
        (0xAD8A0000) | RESULT_OFFSET,           # sw   $t2, off($t4)  (write the final XOR value to that address)
        0x7000003F,   # sdbbp                   (halt the CPU so OpenOCD can read back the result)
        0x00000000,   # nop
    ]

MIPS ASM Cheatsheet (click to expand)

Instruction	What it does	C equivalent
`lui rd, imm`	Load `imm` into top 16 bits of `rd`	`rd = imm << 16`
`ori rd, rs, imm`	OR with zero-extended immediate	`rd = rs \| imm`
`addiu rd, rs, imm`	Add sign-extended immediate	`rd = rs + (short)imm`
`lw rd, off(rs)`	Load 4 bytes from memory	`rd = (int )(rs + off)`
`sw rs, off(rd)`	Store 4 bytes to memory	`(int )(rd + off) = rs`
`xor rd, rs, rt`	Bitwise XOR	`rd = rs ^ rt`
`bne rs, rt, off`	Branch if not equal	`if (rs != rt) goto target`
`sdbbp`	Debug breakpoint—halts CPU for JTAG	`__debugbreak()`
`nop`	No-op (fills branch delay slot)	(nothing)

Words 0-3 change every time because the address range changes. Words 4-13 never change—same accumulator init, same load/XOR/advance loop, same result store, same halt. This distinction matters later when we get to chunk scanning.

run_xor(): The Inner Execution Loop

The run_xor() function inside cpu_scan_and_fix() handles the actual trampoline execution. The pattern is identical to the flush trampoline from post 10—write, verify, resume, check PC—but with a twist: we read back a result.

First it writes a sentinel zero to the result slot at TRAMPOLINE_ADDR + 0x40 so we can distinguish “CPU wrote a zero checksum” from “CPU never ran.” This matters because XOR is its own inverse—a binary that happens to XOR to zero is perfectly valid, and without the sentinel you’d have no way to tell “checksum is zero” from “CPU never stored anything.”

Then it writes all 14 program words via mww and verifies every single one via mdw readback. Yes, 14 PRACC round-trips just to confirm the program is correct. This is a 56-byte program. At PRACC’s 0.04% error rate the odds of a bad write are tiny, but if even one instruction is wrong the CPU computes garbage and we blame the RAM for being corrupt when really we just XORed with the wrong register. Been there.

After verification, it resumes the CPU at TRAMPOLINE_ADDR:

ocd.cmd(f"resume {TRAMPOLINE_ADDR}", timeout=5.0)
ocd.cmd("wait_halt 2000", timeout=5.0)

Then checks that the PC landed at TRAMPOLINE_ADDR + 0x30—that’s instruction 12 (12 x 4 = 0x30 byte offset), the sdbbp. If the PC is anywhere else, something went wrong and the attempt is discarded. Maybe a program word got corrupted during write and the CPU jumped into the weeds, maybe the wait_halt timed out and the CPU is still spinning. Either way, the result is garbage and we throw it away.

Finally it reads the result from TRAMPOLINE_ADDR + 0x40 via a single mdw. That one word is the XOR of every 32-bit word in the checked region, computed by the CPU at 560 MHz with zero PRACC involvement. One read, not thousands.

If the checksum doesn’t match, it retries up to 3 times. The retries exist because the 14-word program write or the single-word result read can still get flipped by PRACC. But 3 failures in a row on a 15-word exchange means something is actually fucked, and we should stop pretending otherwise.

The XOR Cancellation Problem

Here’s the thing about XOR checksums that keeps me up at night: they can lie. If word A got corrupted from 0x11111111 to 0x22222222, that’s an XOR delta of 0x33333333. If word B also got corrupted and happens to introduce the exact same delta of 0x33333333, the two deltas cancel out and the final XOR matches the expected value. Congratulations, your checksum just told you 6.9 MB of data is perfect when two words are wrong.

Over a small region this is astronomically unlikely. Over 6.9 MB—roughly 1.8 million words—with PRACC’s error rate, it’s not impossible. Could I have used CRC32 or something stronger? Sure, but that means a bigger MIPS program, more instructions to get wrong, and more PRACC writes to set it up. XOR is 3 instructions in the inner loop: lw, xor, addiu. CRC32 would be at least 15 with the polynomial shifts and table lookups. On a CPU where every trampoline word is a potential PRACC failure, shorter programs are safer programs.

It’s why we don’t just run a single full-binary XOR and call it a day. The chunk-based approach makes cancellation within a single 8 KB chunk (2048 words) far less probable, and a corrupt chunk that happens to pass its local XOR will almost certainly fail when the final full-binary pass XORs it together with the rest of the image.

verify_and_fix(): The Old Way (Don’t Do This)

Before cpu_scan_and_fix() existed, I had verify_and_fix(). It’s still in the code, deprecated but not deleted, because I’m sentimental about my mistakes. The idea was straightforward: use OpenOCD’s dump_image to read back the entire binary, compare it against the file, fix any mismatches with mww, re-verify with mdw.

It had two failure modes that made it completely unreliable.

Phantom errors: the RAM is correct, but PRACC misreads a word during verification. The script sees a mismatch, “fixes” the already-correct word by overwriting it with the same value (or worse, the PRACC write also flips a bit and now the RAM is actually wrong). You’ve taken correct data and potentially corrupted it in the name of fixing it.

Missed errors: the RAM is corrupt, but PRACC introduces a second bit flip during the readback that happens to produce the expected value. The script sees a match and moves on. Corrupt word stays corrupt.

The fundamental problem is obvious in hindsight: if your verification uses the same error-prone channel as your data transfer, your verification inherits the same error rate. You’re checking PRACC’s work with PRACC. That’s like asking the kid who failed the test to grade his own paper.

I ran verify_and_fix() for about a week before I realized it was making things worse. The symptom was maddening—the script would report “fixed 12 words” and then the kernel would panic on boot anyway. Sometimes it would take 3 or 4 full load-verify cycles to get a clean boot. The fix rate should have been a clue. Twelve words “fixed” out of 1.8 million means some of those fixes were phantom corrections on words that were already fine. I just didn’t want to believe my verification was the problem because I’d spent a whole evening writing it.

cpu_scan_and_fix(): The Good Approach

This is the one that actually works. The core insight is simple—make the CPU do the verification. The CPU reads RAM over its internal bus at 560 MHz with zero bit errors. PRACC is only used for the thin control layer: writing the tiny checksum program and reading back one result word.

The function divides the 6.9 MB binary into 847 chunks of 8192 bytes each. For each chunk:

Update words 0-3 of the checksum program (the LUI/ORI address pairs)
CPU runs the XOR over that 8 KB chunk
Compare the CPU’s result against the Python-computed expected XOR for that chunk
If mismatch: rewrite the entire chunk from file bytes, re-verify with a second CPU XOR
Up to 3 rewrite attempts per bad chunk

Here’s the key optimization: words 4-13 of the program never change. So cpu_scan_and_fix() writes them once at the start and only updates words 0-3 for each new chunk. That’s 4 mww writes plus 1 mdw read per chunk—5 PRACC operations total. Compare that to verify_and_fix() which needed 2048 mdw reads per 8 KB chunk. We went from 2048 PRACC round-trips to 5. That’s not an optimization, that’s a different universe.

# Write constant program body once (words 4-13)
for i, word in enumerate(BODY_WORDS):
    addr = int(TRAMPOLINE_ADDR, 16) + (4 + i) * 4
    ocd.cmd(f"mww 0x{addr:08x} 0x{word:08x}")

# Per-chunk: only update the 4 address words
for chunk_idx in range(num_chunks):
    chunk_start = load_addr + chunk_idx * CHUNK_SIZE
    header = make_checksum_header(chunk_start, CHUNK_SIZE)
    for i, word in enumerate(header):  # just 4 words
        addr = int(TRAMPOLINE_ADDR, 16) + i * 4
        ocd.cmd(f"mww 0x{addr:08x} 0x{word:08x}")
    # ... resume, wait_halt, read result, compare ...

Typically only about 5 out of 847 chunks come back bad. Five. Out of 847. That’s a 0.6% chunk failure rate, which tracks perfectly with PRACC’s per-word error rate across 2048 words per chunk. The bad chunks get completely rewritten from file bytes—no guessing which specific words are wrong, just nuke the whole 8 KB and rewrite it. Then a second CPU XOR confirms the rewrite took. If it still fails after 3 attempts, something is seriously wrong with that memory region and the function bails.

The Final Full-Binary XOR

After the chunk scan completes with all 847 chunks verified, there’s one more pass: a full-binary XOR across the entire 6.9 MB. Same make_checksum_program(), same run_xor(), but with start = LOAD_ADDR and size = len(file_data).

This is the belt-and-suspenders check. The chunk scan should have caught and fixed everything, but the full-binary pass guards against the edge case where a chunk rewrite introduced a new error that wasn’t caught by the per-chunk re-verify (remember, PRACC can flip bits during the rewrite too). It also catches the XOR cancellation scenario I mentioned earlier—two bad words in different chunks that individually pass their chunk checksums but would be caught when XORed together with the rest of the binary.

Both the chunk scan and the full-binary XOR must agree before we proceed to launch. If either fails, load_and_run() returns False and the main loop power-cycles for another attempt. In practice, I’ve never seen the full-binary check fail after a clean chunk scan. But “never seen it fail” is not the same as “can’t fail,” and I already learned that lesson the hard way with verify_and_fix().

Why This Works

The elegance of cpu_scan_and_fix() comes down to one principle: use the right tool for each job. The CPU is perfect at reading RAM—560 MHz, zero errors, direct bus access. PRACC is fine for small exchanges—writing a 14-word program or reading a single result. The old approach used PRACC for everything and inherited its error rate across millions of operations. The new approach uses PRACC for 5 operations per chunk and lets the CPU handle the other 2048. You’re not fighting the hardware anymore. You’re working with it.

Five bad chunks out of 847 is a typical run. Rewrite those five, re-verify, full-binary confirm, launch. The whole verification phase takes about 90 seconds—most of that is the 847 individual trampoline executions, each taking roughly 100ms of PRACC overhead. It’s not fast. But it’s correct, and on a transport layer that flips bits 0.04% of the time, correct is the only thing that matters.

Once cpu_scan_and_fix() was in place—847 chunks verified, 5 rewrites, full-binary pass, kernel booted clean. The whole verify_and_fix() approach was just using the wrong tool for the job.