7 - MR18 Deep Dive: Hand-Encoded MIPS Assembly

Encoding instructions by hand like it’s 1985

They said “use a tool,” I said “nah I got this”

Post 4 mentioned that I hand-encoded every MIPS instruction as hex constants directly in Python. I figured it’s worth breaking down what that actually looks like in practice. So here’s the deep dive—every trampoline, every encoding decision, and the one-bit typo that almost ruined everything.

Why Hand-Encoded

I didn’t have a MIPS cross-assembler installed. I could have installed one. I chose not to because I’m stubborn and because encoding MIPS instructions by hand is genuinely educational in a way that using a toolchain isn’t. My computer architecture class taught me the formats on paper. Actually producing working machine code for a real CPU on real hardware is a completely different experience. Also I only needed like 34 instructions total. It’s not like I was writing an operating system.

MIPS32 Instruction Formats

Every MIPS32 instruction is exactly 32 bits. Three formats:

R-type: [ op 6 | rs 5 | rt 5 | rd 5 | sa 5 | funct 6 ]
I-type: [ op 6 | rs 5 | rt 5 | imm 16                ]
J-type: [ op 6 | target 26                            ]

R-type is for register-register ops (XOR, ADD, etc.)—op is zero and funct tells the ALU what to do. I-type handles immediates—ADDIU, LW, BNE, LUI, ORI. J-type is just J and JAL. The encoding is mechanical—look up the opcode, slot register numbers into bit positions, shift, OR, done. For example, XOR $t2, $t2, $t3:

# R-type: op=0, rs=$t2(10), rt=$t3(11), rd=$t2(10), sa=0, funct=0x26
# (0 << 26) | (10 << 21) | (11 << 16) | (10 << 11) | (0 << 6) | 0x26
# = 0x014B5026

Once you do a few of these it becomes second nature. The register encoding trips you up—$t0 is register 8, not 0. $t2 is 10. MIPS register numbering is a rite of passage.

The Trampolines

Before I get into each one—there are two different cache flush trampolines and it confused me at first too, so let me explain why both exist.

D_CACHE_FLUSH_TRAMPOLINE is the brute-force approach. It reads 128 KB of sequential KSEG0 addresses, which forces the CPU’s LRU replacement policy to evict every dirty cache line. It doesn’t use any special instructions—just lw in a loop. Think of it like filling a bucket until the old water overflows out. This one runs BEFORE load_image (Phase 0) to get rid of the dirty Cisco data.

FLUSH_TRAMPOLINE is the surgical approach. It uses the actual cache instruction to explicitly write back and invalidate every cache line by index. Instead of flooding the cache with reads, it walks every cache set and says “write your dirty data to RAM and then forget about it.” This one runs AFTER load_image (Phase 2) as a belt-and-suspenders cleanup.

Why not just use one? Honestly? Either one alone would have worked fine. The AR9344’s cache instruction has no errata that I ever found, and the brute-force LRU version does the job just as well. The real reason both exist is that I wrote the brute-force one first, then wrote the cache-instruction one later thinking it was “better,” and kept both because after spending hours staring at data error! I was too paranoid to remove anything that touched the cache.

D-Cache Flush Trampoline (8 words)

The brute-force one. Cisco’s dirty cache lines need to go before we load our binary. The strategy: read 128 KB of sequential KSEG0 addresses. The D-cache is 32 KB, 4-way set associative, 256 sets, 32-byte lines. Reading 4x the cache size guarantees every set gets 4 new lines via LRU, evicting all dirty originals.

In C, this is what the CPU is doing:

// evict every dirty D-cache line by reading 128KB of KSEG0 addresses
volatile int *ptr = (volatile int *)0x80000000;  // KSEG0 start
volatile int *end = (volatile int *)0x80020000;  // KSEG0 + 128KB
while (ptr < end) {
    (void)*ptr;       // read forces a cache line fill, evicting the old dirty line
    ptr += 8;         // advance 32 bytes (8 words = one cache line)
}
__debugbreak();       // return control to the JTAG debugger

And here’s the actual hand-encoded MIPS:

D_CACHE_FLUSH_TRAMPOLINE = [
    0x3C088000,  # lui  $t0, 0x8000        (set t0 to 0x80000000, the start of KSEG0 cached memory)
    0x3C098002,  # lui  $t1, 0x8002        (set t1 to 0x80020000, 128KB past start = loop end address)
    0x8D0A0000,  # lw   $t2, 0($t0)        (read a word from the address in t0, forcing a cache line fill)
    0x25080020,  # addiu $t0, $t0, 32      (move t0 forward by 32 bytes = one cache line width)
    0x1509FFFD,  # bne  $t0, $t1, -3       (if t0 != t1, jump back 3 instructions to the lw)
    0x00000000,  # nop                      (delay slot: always executes after the branch, does nothing)
    0x7000003F,  # sdbbp                    (trigger a debug breakpoint so OpenOCD regains control)
    0x00000000,  # nop
]

MIPS ASM Cheatsheet (click to expand)

Instruction	What it does	C equivalent
`lui rd, imm`	Load upper immediate—puts `imm` into the top 16 bits of `rd`, zeros the bottom 16	`rd = imm << 16`
`ori rd, rs, imm`	OR immediate—ORs the bottom 16 bits of `rs` with `imm`, zero-extended (NOT sign-extended)	`rd = rs \| imm`
`addiu rd, rs, imm`	Add immediate unsigned—adds `imm` to `rs`. WARNING: `imm` IS sign-extended, so `0x8000` becomes `-32768`	`rd = rs + (short)imm`
`lw rd, offset(rs)`	Load word—reads 4 bytes from memory at address `rs + offset` into `rd`	`rd = (int )(rs + offset)`
`sw rs, offset(rd)`	Store word—writes the 4 bytes in `rs` to memory at address `rd + offset`	`(int )(rd + offset) = rs`
`xor rd, rs, rt`	XOR—bitwise exclusive-or of two registers	`rd = rs ^ rt`
`bne rs, rt, offset`	Branch if not equal—if `rs != rt`, jump `offset` instructions forward/backward from the NEXT instruction	`if (rs != rt) goto target`
`j target`	Jump—unconditional jump to a 26-bit pseudo-direct address. Top 4 bits come from current PC	`goto target`
`cache op, offset(rs)`	Cache operation—`op=0x00` invalidates an I-cache line, `op=0x01` writes back + invalidates a D-cache line	no C equivalent (hardware cache control)
`sdbbp`	Software debug breakpoint—traps into EJTAG debug mode. This is how the CPU tells OpenOCD “I’m done”	`__debugbreak()`
`nop`	No operation—does nothing. Required after branches/jumps because MIPS has a branch delay slot (the CPU always executes the instruction right after a branch, even if the branch is taken)	(nothing)

Registers used: $t0-$t4 are temporary registers (numbers 8-12). $zero is register 0, always reads as 0. The $t registers are caller-saved—the kernel can trash them during a syscall.

Branch delay slot: This trips everyone up. In MIPS, after every branch or jump, the NEXT instruction ALWAYS executes regardless of whether the branch was taken. That’s why you see nop after every bne and j—it fills the delay slot with a harmless no-op.

Eight words. Load from 0x80000000, step by 32 bytes (one cache line), loop until 0x80020000, halt. The SDBBP instruction (Special Debug Breakpoint) is an EJTAG thing—it traps back into debug mode so OpenOCD regains control.

FLUSH_TRAMPOLINE: I/D-Cache Flush (10 words)

This is the surgical one. Instead of brute-forcing the cache with reads, it uses actual cache instructions to explicitly write back and invalidate every line by index. Faster, more precise, and hits both I-cache and D-cache.

In C (with made-up intrinsics for the cache instructions):

// walk every cache index and explicitly flush both I-cache and D-cache
unsigned int *addr = (unsigned int *)0x80000000;  // KSEG0 start
unsigned int *end  = (unsigned int *)0x80008000;  // KSEG0 + 32KB (full cache size)
while (addr < end) {
    __invalidate_icache_line(addr);    // I-cache: throw away the cached instruction at this index
    __writeback_invalidate_dcache_line(addr);  // D-cache: write dirty data to RAM, then throw away the line
    addr += 8;                         // advance 32 bytes (one cache line)
}
__debugbreak();

The actual MIPS:

FLUSH_TRAMPOLINE = [
    0x3C088000,  # lui  $t0, 0x8000        (set t0 to 0x80000000, start of KSEG0)
    0x3C098000,  # lui  $t1, 0x8000        (set t1 to 0x80000000 temporarily)
    0x35298000,  # ori  $t1, $t1, 0x8000   (t1 = 0x80000000 | 0x8000 = 0x80008000, end after 32KB)
    0xBD000000,  # cache 0x00, 0($t0)      (invalidate the I-cache line at the index in t0)
    0xBD010000,  # cache 0x01, 0($t0)      (write back + invalidate the D-cache line at t0's index)
    0x25080020,  # addiu $t0, $t0, 32      (move t0 forward by 32 bytes = one cache line)
    0x1509FFFC,  # bne  $t0, $t1, -4       (if t0 != t1, jump back 4 instructions to the cache ops)
    0x00000000,  # nop                      (delay slot: runs after every branch, does nothing)
    0x7000003F,  # sdbbp                    (trigger a debug breakpoint so OpenOCD regains control)
    0x00000000,  # nop
]

cache 0x00 is I-cache Index_Invalidate. cache 0x01 is D-cache Index_Writeback_Invalidate. Both caches are 32 KB (4 ways x 256 sets x 32 bytes), so the loop covers 0x0000 through 0x7FFF.

Now here’s a subtle thing. See that ori $t1, $t1, 0x8000? My first version used addiu $t1, $zero, 0x8000. That’s wrong. ADDIU sign-extends its 16-bit immediate, so 0x8000 becomes 0xFFFF8000—negative 32768, not positive. The loop would never terminate because $t0 (incrementing from 0) would never equal 0xFFFF8000. ORI zero-extends, giving you exactly 0x00008000. Sign extension is a bitch.

Launch Trampoline (2 words + NOPs)

The simplest one. After loading and verifying the binary, we need to jump to the lzma-loader entry point at 0xA0060000 (KSEG1 uncached).

In C:

// jump to the lzma-loader and never come back
goto *0xA0060000;

Yeah, that’s it. One jump. The MIPS:

LAUNCH_TRAMPOLINE = [
    0x08018000,  # j 0xA0060000            (jump to the lzma-loader entry point in uncached KSEG1)
    0x00000000,  # nop                      (delay slot: CPU always executes the instruction after a jump)
    0x00000000,  # nop                      (padding)
    0x00000000,  # nop                      (padding)
]

J-type encoding: the 26-bit target is the address right-shifted by 2 (instructions are word-aligned). 0xA0060000 >> 2 = 0x28018000. Top 4 bits come from PC, so target field is 0x28018000 & 0x03FFFFFF = 0x0018000. With opcode 0x02: (0x02 << 26) | 0x0018000 = 0x0A018000.

XOR Checksum Program (14 words)

This one is generated dynamically by make_checksum_program(start, end) because the start and end addresses change for each 8 KB chunk being verified. In C:

// XOR every 32-bit word in a memory range and store the result
uint32_t *ptr    = (uint32_t *)start;
uint32_t *end    = (uint32_t *)(start + size);
uint32_t  result = 0;
while (ptr < end) {
    result ^= *ptr;   // XOR each word into the accumulator
    ptr++;
}
*(volatile uint32_t *)RESULT_ADDR = result;  // store where OpenOCD can read it
__debugbreak();

The Python that generates the MIPS:

def make_checksum_program(start, end):
    return [
        0x3C080000 | (start >> 16),         # lui  $t0, start_hi   (load upper 16 bits of start address into t0)
        0x35080000 | (start & 0xFFFF),      # ori  $t0, $t0, start_lo (fill in lower 16 bits, t0 = full start addr)
        0x3C090000 | (end >> 16),            # lui  $t1, end_hi     (load upper 16 bits of end address into t1)
        0x35290000 | (end & 0xFFFF),         # ori  $t1, $t1, end_lo (fill in lower 16 bits, t1 = full end addr)
        0x240A0000,                          # addiu $t2, $zero, 0  (set accumulator t2 to zero)
        0x8D0B0000,                          # lw   $t3, 0($t0)    (read one 32-bit word from memory at t0)
        0x014B5026,                          # xor  $t2, $t2, $t3  (XOR the word into the accumulator)
        0x25080004,                          # addiu $t0, $t0, 4   (advance pointer by 4 bytes = one word)
        0x1509FFFC,                          # bne  $t0, $t1, -4   (if t0 != t1, jump back to the lw instruction)
        0x00000000,                          # nop                  (delay slot)
        0x3C0C0000 | (base >> 16),           # lui  $t4, base_hi   (load upper bits of result storage address)
        (0xAD8A0000) | RESULT_OFFSET,        # sw   $t2, offset($t4) (write the XOR result to a known memory spot)
        0x7000003F,                          # sdbbp               (halt the CPU so OpenOCD can read the result)
        0x00000000,                          # nop
    ]

LUI/ORI pairs build the start and end addresses. The loop XORs every word into $t2, stores the result at 0x8006FFF0, and halts. OpenOCD reads that single word over PRACC and compares against the Python-computed expected XOR. One PRACC read instead of 2000+.

The BEQ/BNE Bug

The D-cache flush trampoline needs BNE—branch back while $t0 != $t1. Opcode 0x05. I typed 0x04. That’s BEQ—Branch if Equal.

One. Fucking. Bit. The loop branched when $t0 == $t1 (termination) instead of $t0 != $t1 (continuation). It executed once—loaded one cache line, incremented, compared, not equal so BEQ falls through, hits SDBBP, halts. One line flushed out of 4096. The flush “succeeded” from OpenOCD’s perspective—CPU halted normally, no exceptions. But it flushed 0.02% of the cache. The other 99.98% of dirty Cisco lines were still sitting there.

I spent a solid hour staring at data error! before running verify_asm.py. Capstone showed beq t0, t1, ... where I expected bne. Changed 0x1508FFFD to 0x1509FFFD and everything worked.

verify_asm.py: Trust But Verify

After the BEQ incident I needed independent verification. Enter Capstone—a multi-architecture disassembly framework with Python bindings. The verification script has three helpers mirroring the encoding formats:

def R(op, rs, rt, rd, sa, funct):
    return (op << 26) | (rs << 21) | (rt << 16) | (rd << 11) | (sa << 6) | funct

def I(op, rs, rt, imm):
    return (op << 26) | (rs << 21) | (rt << 16) | (imm & 0xFFFF)

def J(op, target):
    return (op << 26) | (target & 0x03FFFFFF)

These produce the same 32-bit values I hand-computed, but parameterized so I can actually read what each field is. A check() function takes (hex_value, expected_mnemonic, expected_operands) tuples, packs each word, feeds it to Capstone’s MIPS big-endian disassembler, and asserts the output matches. Capstone is completely independent ground truth—different authors, different code, different test suites. If my hex and Capstone agree, the odds of a shared bug are basically zero. After the BEQ incident I ran this religiously after every change. Paranoia earned.

Closing Thoughts

Hand-encoding 34 MIPS instructions took maybe 3 hours, including the BEQ debugging. Installing a cross-assembler would have taken 20 minutes and I wouldn’t have learned shit. The sign-extension gotcha, the delay slots, the J-type target encoding from KSEG1—I know all of it cold now because I had to think about every single bit. Would I recommend this for a larger project? Absolutely not. For a handful of tiny programs on an embedded target with a debugger attached? Most fun you can have with 32 bits.