The cache coherency nightmare and other fun MIPS things

I been goin’ through some things, I was pourin’ and the Henny

So last post ended with the binary loaded, the CPU launched, and data error! on the serial console. The lzma-loader—the little decompressor that’s supposed to unpack our kernel—was choking on the LZMA stream. The binary I loaded was correct (I verified the MD5 before loading), the ath79 build was the right one, the address was correct. What the fuck.

PRACC Bit Errors: Your JTAG Cable Is a Liar

The first thing I discovered was that load_image is not a reliable operation. Like, at all.

I compared a 45 KB sample of what was in RAM against the original file. 18 single-bit flips. Scattered randomly across the region. That’s roughly a 0.04% error rate per word, which across a 6.9 MB binary means about 60 corrupted words per load.

Now the obvious thought is “bad wires, signal integrity, you’re using a shitty ESP-Prog.” And yeah the ESP-Prog is shitty, but that’s not the issue. The bit-flip rate doesn’t change when you reduce the JTAG clock speed—I tried 1000 kHz, 500 kHz, 100 kHz, same error rate. If it were signal integrity, slower clocks would help. These are PRACC protocol errors—timing violations in the MIPS EJTAG Processor Access handshake state machine. The debug interface just occasionally drops a bit during the handshake, and there’s nothing you can do about it at the transport level.

So okay, the transport is unreliable. You know what my degree taught me? Verify your shit and build error correction. So I did what anyone would do—read back the binary from RAM and compare it against the file, then rewrite any bad words.

The Phantom Verify Problem

My first verify approach used OpenOCD’s mdw command to read back every word from RAM via PRACC. Read the word, compare it against the file, if it doesn’t match then rewrite it. Simple and elegant.

Except PRACC reads have the same 0.04% bit-flip rate as PRACC writes.

The verify step was as unreliable as the load step. I was getting two beautiful failure modes:

  1. Phantom errors: A word in RAM is perfectly fine, but PRACC reads it back wrong. My script “fixes” it by rewriting the correct value that was already there. Thanks for nothing.
  2. Missed errors: A word in RAM is genuinely corrupt, but PRACC reads it back with a SECOND bit flip that happens to produce the expected value. The verify step sees a match and moves on. The corruption stays.

I ran the verify, it “fixed” 15 words. The full-binary XOR checksum was the same before and after the “fixes.” The lzma-loader still said data error!. Those 15 fixes were all phantoms, and the real corruptions were hiding behind read errors.

This one took me a while to figure out. The lesson is obvious in hindsight: if your verification uses the same error-prone channel as your data transfer, your verification inherits the same error rate. You need a fundamentally different path for verification.

CPU-Executed Checksums: The Actually Good Idea

The solution was to make the CPU itself do the verification. The CPU reads RAM through its L1 cache at 560 MHz with zero bit errors—only JTAG/PRACC operations are unreliable. So instead of reading back millions of words over PRACC, I wrote a tiny MIPS program that runs ON the MR18’s CPU.

The program XORs every 32-bit word in the loaded binary and stores the result at a known address. Then OpenOCD reads back just that single 4-byte result (one PRACC read instead of millions). I precompute the expected XOR value in Python and compare.

For finer granularity, the script divides the binary into 847 chunks of 8 KB each. For each chunk it only needs 5 PRACC operations—write the chunk’s start/end addresses, resume the CPU, wait for halt, read the result. At 5 ops per chunk instead of 2000+ reads, the probability of a PRACC error affecting the result is basically zero.

Any chunk with a mismatched XOR gets completely rewritten from the file and re-verified. Typically only about 5 out of 847 chunks are bad. The whole scan-and-fix pass takes about 60 seconds.

Hand-Encoding MIPS Assembly: Why

Now here’s where it gets kinda nuts. Those “tiny MIPS programs” I mentioned? I didn’t use an assembler. I hand-encoded every single instruction as 32-bit hexadecimal constants directly in Python.

Why? Honestly because I didn’t have a MIPS assembler set up and I’m stubborn. Also because there’s something deeply satisfying about encoding instructions at the bit level when you understand the architecture. My embedded systems and computer architecture classes actually came in clutch here—who would have thought.

MIPS32 instructions are all exactly 32 bits and come in three formats: R-type (register operations), I-type (immediate values), and J-type (jumps). The encoding is mechanical once you know the format:

# This is how you encode "XOR $t2, $t2, $t3" by hand
# R-type: op=0, rs=10($t2), rt=11($t3), rd=10($t2), sa=0, funct=0x26(XOR)
# (0 << 26) | (10 << 21) | (11 << 16) | (10 << 11) | (0 << 6) | 0x26
# = 0x014B5026

I wrote three “trampolines”—tiny programs that get written to a scratch area in RAM and executed:

  1. D-Cache Flush Trampoline (8 words): Reads 128 KB of sequential KSEG0 addresses to evict all dirty cache lines via LRU replacement. More on why this exists in a second.
  2. XOR Checksum Program (14 words): The chunk verification program described above.
  3. Launch Trampoline (2 words + NOPs): A single J 0xA0060000 instruction that jumps to the lzma-loader entry point.

I also wrote a verification script (verify_asm.py) that feeds my hand-encoded hex to the Capstone disassembler and confirms it disassembles to what I intended. Good thing I did too, because…

BEQ vs BNE: One Bit, One Very Bad Day

The D-cache flush trampoline has a loop. The loop should branch back to the top while the pointer hasn’t reached the end address. That’s a BNE (Branch if Not Equal) instruction—opcode 0x05.

I typed 0x04. That’s BEQ—Branch if Equal.

One bit difference. The loop branched when $t0 == $t1 (the termination condition) instead of when $t0 != $t1 (the continuation condition). The loop body executed exactly once, flushing a single cache line instead of the full 4096 lines needed to cover the entire cache. The flush “succeeded” (SDBBP was reached, OpenOCD saw the halt) but it did basically nothing.

Capstone caught it. verify_asm.py showed beq where I expected bne. I stared at it for way too long before I realized the opcode field was wrong by a single bit. Changed 0x04 to 0x05, reran, flush loop now actually loops. Beautiful.

The D-Cache Stale Data Problem

Okay this is the big one. This is the bug that drove me insane and is also genuinely the most interesting computer architecture problem I’ve ever debugged in practice.

Here’s the setup: the MIPS32 architecture has two kernel segments that map to the same physical memory—KSEG0 (cached, virtual addresses 0x80000000-0x9FFFFFFF) and KSEG1 (uncached, 0xA0000000-0xBFFFFFFF). Both segments map to physical 0x00000000-0x1FFFFFFF by just masking off the top bits. The ONLY difference is that KSEG0 goes through the CPU’s data cache and KSEG1 does not.

This means 0x80060000 and 0xA0060000 both point to physical address 0x00060000. But reading from 0x80060000 might give you whatever’s sitting in the D-cache (which could be stale), while reading from 0xA0060000 always gives you what’s actually in physical RAM.

Here’s what was happening:

Step 1: The MR18 powers on. The Nandloader reads the Cisco kernel from NAND and writes it to 0x8005FC00 (KSEG0, cached). These writes go into the D-cache as dirty lines. The AR9344 uses write-back caching, meaning the dirty data just sits in cache and gets written to physical RAM whenever the cache feels like it (or when the line gets evicted).

Step 2: I halt the CPU and use load_image to write our OpenWrt binary to 0xA005FC00 (KSEG1, uncached). These writes go straight to physical RAM, completely bypassing the cache. Physical RAM now contains our correct OpenWrt binary. But the D-cache still has dirty Cisco data for those same physical addresses. The cache has no idea we just wrote different data to the physical memory underneath it.

Step 3: My verification (XOR checksums) runs via KSEG1—uncached reads straight from physical RAM. Everything checks out. All 847 chunks pass. Full XOR matches. I think we’re golden.

Step 4: I launch the lzma-loader. It runs via KSEG0—cached. When it reads the compressed kernel data, the D-cache serves the stale Cisco data from its dirty lines instead of our correct OpenWrt data in physical RAM. The LZMA stream is corrupted. data error!

All my verification was correct because I was reading through KSEG1 (uncached, sees real RAM). All my execution was broken because the lzma-loader reads through KSEG0 (cached, sees stale Cisco garbage in the D-cache).

Making It Worse Before Making It Better

When I figured out that cache coherency was the problem, my first thought was “okay, flush the D-cache after loading the binary.” I wrote the D-cache flush trampoline (the one with the BEQ/BNE bug), fixed the BEQ bug, ran it AFTER load_image. Still data error!.

It took me an embarrassingly long time to realize that flushing AFTER the load is worse than not flushing at all. Here’s why:

  1. load_image writes OpenWrt to physical RAM via KSEG1. RAM is correct. Cache still has dirty Cisco lines.
  2. Flush trampoline runs via KSEG0. The flush loop reads 128 KB of addresses. These reads need cache lines.
  3. The cache sets are full of dirty Cisco lines. LRU eviction fires—the dirty Cisco lines get written back to physical RAM.
  4. The Cisco write-back OVERWRITES our freshly loaded OpenWrt binary.

The flush I designed to fix the problem was the thing destroying my binary. The stale Cisco data in the cache gets evicted and written back to RAM, overwriting the exact data I just loaded.

The fix is to flush BEFORE load_image. Before the load, flushing writes back the Cisco data to RAM—who cares, we’re about to overwrite that RAM anyway. After the flush, the cache is clean. Then load_image writes our binary to physical RAM via KSEG1. The cache has no dirty lines for those addresses. When the lzma-loader reads via KSEG0, every access is a cache miss that fetches the correct data from physical RAM.

Before vs after. Same operation, opposite outcomes. This is why I love and hate computer architecture.

The Full Pipeline

After all these bugs (and this is only some of them—I hit 13 distinct bugs before the binary even booted successfully), the final load pipeline looks like this:

  1. Phase 0: Pre-load D-cache flush (evict Cisco’s dirty lines)
  2. Phase 1: load_image (6.9 MB at ~97 KB/s, ~70 seconds)
  3. Phase 2: Post-load D-cache flush (belt and suspenders)
  4. Phase 3a: Full-binary XOR checksum
  5. Phase 3b: Per-chunk CPU-XOR scan, rewrite any bad chunks
  6. Phase 3c: Final full-binary XOR checksum
  7. Phase 4: Launch trampoline—jump to lzma-loader

Total time: about 2-2.5 minutes of JTAG operations. Then the CPU leaves our control, the lzma-loader decompresses the kernel, and Linux boots.

And finally, after all of that…

[    0.000000] Linux version 6.6.73 (builder@buildhost) ...
[    0.000000] CPU: MIPS 74Kc V5.0

OpenWrt was booting on the MR18. But the story doesn’t end here because—surprise—nothing else worked. The next post covers the absolute circus of trying to trigger failsafe mode, which involves 5 failed approaches, a reset supervisor IC that refuses to die, and the discovery that sometimes the simplest solution (sending the letter ‘f’ over serial) is the one that works.