code-obfuscation-deobfuscation
SKILL: Code Obfuscation & Deobfuscation — Expert Analysis Playbook
AI LOAD INSTRUCTION: Expert techniques for identifying, classifying, and defeating code obfuscation in native binaries. Covers junk code, opaque predicates, SMC, control flow flattening, movfuscator, VM protectors (VMProtect/Themida/Code Virtualizer), string encryption, import hiding, and anti-disassembly tricks. Base models often conflate packing with obfuscation and miss the distinction between static and dynamic deobfuscation strategies.
0. RELATED ROUTING
- anti-debugging-techniques when the obfuscated binary also has anti-debug layers
- symbolic-execution-tools when using angr/Z3 for automated deobfuscation
- vm-and-bytecode-reverse for deep VM protector bytecode analysis
Quick identification picks
| Symptom in IDA/Ghidra | Likely Obfuscation | Start With |
|---|---|---|
| Flat CFG, single giant switch | Control flow flattening | Symbolic execution to recover CFG |
Only mov instructions |
movfuscator | demovfuscation / trace-based lifting |
| pushad/pushfd → VM entry | VM protector | Handler table extraction |
| XOR loop before code execution | SMC / string encryption | Dynamic analysis, breakpoint after decode |
| Impossible conditions (opaque predicates) | Junk code insertion | Pattern-based removal |
| All strings unreadable | String encryption | Hook decryption routine, or emulate |
| No imports in IAT | Import hiding | Trace GetProcAddress / hash resolution |
1. JUNK CODE & OPAQUE PREDICATES
1.1 Junk Code Insertion
Dead code that never affects program output, added to increase analysis time.
Identification:
- Instructions that write to registers/memory never read afterward
- Function calls whose return values are discarded and have no side effects
- Loops with invariant bounds that compute unused results
Removal strategy:
- Compute def-use chains (IDA/Ghidra data flow analysis)
- Mark instructions with no downstream use as dead
- Verify removal doesn't change program behavior (trace comparison)
1.2 Opaque Predicates
Conditional branches where the condition is always true or always false, but this is non-obvious.
| Type | Example | Always Evaluates To |
|---|---|---|
| Arithmetic | x² ≥ 0 |
True |
| Number theory | x*(x+1) % 2 == 0 |
True (product of consecutive ints) |
| Pointer-based | ptr == ptr after aliasing |
True |
| Hash-based | CRC32(constant) == known_value |
True |
Deobfuscation:
- Abstract interpretation: prove the condition is constant
- Symbolic execution: Z3 proves
∀x: predicate(x) = True - Pattern matching: recognize known opaque predicate families
- Dynamic: trace and observe the branch is never taken / always taken
import z3
x = z3.BitVec('x', 32)
s = z3.Solver()
s.add(x * (x + 1) % 2 != 0)
print(s.check()) # unsat → always true
2. SELF-MODIFYING CODE (SMC)
Runtime code patching: encrypted code is decrypted just before execution.
2.1 XOR Decryption Loop (Most Common)
lea esi, [encrypted_code]
mov ecx, code_length
mov al, xor_key
decrypt_loop:
xor byte [esi], al
inc esi
loop decrypt_loop
jmp encrypted_code ; now decrypted
2.2 Analysis Strategy
1. Identify the decryption routine (look for XOR/ADD/SUB in loops writing to .text)
2. Set breakpoint AFTER the loop completes
3. At breakpoint: dump the decrypted memory region
4. Re-analyze the dumped code in IDA/Ghidra
5. For multi-layer: repeat for each decryption stage
2.3 Automated Unpacking via Emulation
from unicorn import *
from unicorn.x86_const import *
mu = Uc(UC_ARCH_X86, UC_MODE_32)
mu.mem_map(0x400000, 0x10000)
mu.mem_write(0x400000, binary_code)
mu.emu_start(decrypt_entry, decrypt_end)
decrypted = mu.mem_read(code_start, code_length)
3. CONTROL FLOW FLATTENING (CFF)
3.1 Structure
Original sequential blocks are transformed into a dispatcher loop:
Original: A → B → C → D
Flattened: ┌──────────────────┐
│ dispatcher │
│ switch(state) │◄─────┐
├──────────────────┤ │
│ case 1: block A │──────┤
│ case 2: block B │──────┤
│ case 3: block C │──────┤
│ case 4: block D │──────┘
└──────────────────┘
Each block sets state = next_state before jumping back to the dispatcher.
3.2 Recovery Techniques
| Technique | Tool | Effectiveness |
|---|---|---|
| Symbolic execution | angr, Triton, miasm | High — traces all state transitions |
| Trace-based recovery | Pin/DynamoRIO trace → reconstruct CFG | Medium — covers executed paths only |
| Pattern matching | Custom IDA/Ghidra script | Medium — works for known flatteners |
| D-810 (IDA plugin) | IDA Pro | High — specifically designed for CFF |
3.3 Symbolic Deflattening (angr approach)
import angr, claripy
proj = angr.Project('./obfuscated')
cfg = proj.analyses.CFGFast()
# Find dispatcher block (highest in-degree basic block)
dispatcher = max(cfg.graph.nodes(), key=lambda n: cfg.graph.in_degree(n))
# For each case block, symbolically determine successor
for block in case_blocks:
state = proj.factory.blank_state(addr=block.addr)
# ... solve state variable to find real successor
4. MOVFUSCATOR
4.1 Concept
All computation reduced to mov instructions only (Turing-complete via memory-mapped computation tables). Created by Christopher Domas.
4.2 Identification
- Function contains only
movinstructions (no add, sub, xor, jmp, call) - Large lookup tables in data section
- Memory-mapped flag registers
4.3 Demovfuscation
| Approach | Description |
|---|---|
| demovfuscator (tool) | Static analysis, recovers original operations from mov patterns |
| Trace + taint analysis | Run with Pin/DynamoRIO, taint inputs, observe computation |
| Symbolic execution | Treat entire function as constraint system |
5. VM PROTECTION (VMProtect / Themida / Code Virtualizer)
5.1 VM Architecture
Protected code → bytecode compiler → custom bytecode
Runtime: VM entry (pushad/pushfd) → fetch → decode → execute → VM exit (popad/popfd)
5.2 VM Entry Point Identification
; Typical VMProtect entry
pushad ; save all registers
pushfd ; save flags
mov ebp, esp ; VM stack frame
sub esp, VM_LOCALS_SIZE ; allocate VM context
mov esi, bytecode_addr ; bytecode instruction pointer
jmp vm_dispatcher ; enter VM loop
5.3 Handler Table Extraction
1. Find dispatcher (large switch or indirect jump via table)
2. Each case/entry = one VM handler (implements one VM opcode)
3. Map handler addresses to operations by analyzing each handler:
- Handler reads operand from bytecode stream (esi)
- Performs operation on VM registers/stack
- Advances bytecode pointer
- Returns to dispatcher
5.4 Devirtualization Approaches
| Method | Description | Tool |
|---|---|---|
| Manual handler mapping | Reverse each handler, build ISA spec | IDA + scripting |
| Trace recording | Record all handler executions, reconstruct program | REVEN, Pin |
| Symbolic lifting | Symbolically execute handlers, lift to IR | Triton, miasm |
| Pattern matching | Match handler patterns to known VM families | Custom scripts |
5.5 VMProtect Specifics
- Uses opaque predicates in dispatcher
- Handler mutation: same opcode, different handler code per build
- Multiple VM layers (VM inside VM)
- Integrates anti-debug and integrity checks
6. STRING ENCRYPTION
6.1 Common Patterns
| Pattern | Example | Recovery |
|---|---|---|
| XOR loop | for (i=0; i<len; i++) s[i] ^= key; |
Hook or emulate XOR function |
| Stack strings | mov [esp+0], 'H'; mov [esp+1], 'e'; ... |
IDA FLIRT / Ghidra script to reassemble |
| RC4 encrypted | Encrypted blob + RC4 key in binary | Extract key, decrypt offline |
| AES encrypted | Encrypted blob + AES key derived at runtime | Hook after decryption |
| Custom encoding | Base64 + XOR + reverse | Trace the decode function, replicate |
6.2 Automated String Decryption
# Ghidra script: find XOR decryption calls, emulate them
from ghidra.program.model.symbol import SourceType
decrypt_func = getFunction("decrypt_string")
refs = getReferencesTo(decrypt_func.getEntryPoint())
for ref in refs:
call_addr = ref.getFromAddress()
# extract arguments (encrypted buffer ptr, key, length)
# emulate decryption, add comment with plaintext
7. IMPORT HIDING
7.1 GetProcAddress + Hash Lookup
FARPROC resolve(DWORD hash) {
// Walk PEB → LDR → InMemoryOrderModuleList
// For each DLL, walk export table
// Hash each export name, compare with target hash
// Return matching function pointer
}
7.2 Recovery
- Identify the hash algorithm (common: CRC32, djb2, ROR13+ADD)
- Compute hashes for all known API names
- Build hash → API name lookup table
- Annotate resolved calls in IDA/Ghidra
7.3 Common Hash Algorithms
| Name | Algorithm | Used By |
|---|---|---|
| ROR13 | hash = (hash >> 13 | hash << 19) + char |
Metasploit shellcode |
| djb2 | hash = hash * 33 + char |
Various malware |
| CRC32 | Standard CRC32 of function name | Sophisticated packers |
| FNV-1a | hash = (hash ^ char) * 0x01000193 |
Modern malware |
8. ANTI-DISASSEMBLY TRICKS
8.1 Techniques
| Trick | Mechanism | Fix |
|---|---|---|
| Overlapping instructions | jmp $+2; db 0xE8 (fake call prefix) |
Manual re-analysis from correct offset |
| Misaligned jumps | Jump into middle of multi-byte instruction | Force IDA to re-analyze at target |
| Conditional jump pair | jz $+5; jnz $+3 (always jumps, confuses linear disasm) |
Convert to unconditional jmp |
| Return address manipulation | push addr; ret instead of jmp addr |
Recognize push+ret as jump |
| Exception-based flow | Trigger exception, real code in handler | Analyze exception handler chain |
| Call + add [esp] | call $+5; add [esp], N; ret (computed jump) |
Calculate actual target |
8.2 IDA Fixes
Right-click → Undefine (U)
Right-click → Code (C) at correct offset
Edit → Patch → Assemble (for permanent fix)
9. DECISION TREE
Obfuscated binary — how to approach?
│
├─ Can you run it?
│ ├─ Yes → Dynamic analysis first
│ │ ├─ Set BP on interesting APIs (file, network, crypto)
│ │ ├─ Trace execution to understand real behavior
│ │ └─ Dump decrypted code/strings at runtime
│ │
│ └─ No (embedded/firmware/exotic arch) → Static only
│ └─ Identify obfuscation type from patterns below
│
├─ What does the code look like?
│ │
│ ├─ Giant flat switch/dispatcher loop?
│ │ ├─ State variable drives control flow → CFF
│ │ │ └─ Use D-810 or symbolic deflattening
│ │ └─ Bytecode fetch-decode-execute → VM protection
│ │ └─ Extract handlers, build disassembler
│ │
│ ├─ Only mov instructions?
│ │ └─ movfuscator → demovfuscator tool
│ │
│ ├─ XOR/ADD loop writing to .text section?
│ │ └─ SMC → breakpoint after decode, dump
│ │
│ ├─ Impossible conditions in branches?
│ │ └─ Opaque predicates → Z3 proving or pattern removal
│ │
│ ├─ Disassembly looks wrong / functions overlap?
│ │ └─ Anti-disassembly → manual re-analysis at correct offsets
│ │
│ ├─ No readable strings?
│ │ └─ String encryption → hook decrypt function or emulate
│ │
│ ├─ No imports in IAT?
│ │ └─ Import hiding → identify hash, build lookup table
│ │
│ └─ pushad/pushfd → complex code → popad/popfd?
│ └─ VM protector entry/exit → full VM analysis
│
└─ What tool to use?
├─ Known protector (VMProtect/Themida) → specific deprotection guide
├─ Custom obfuscation → combine: IDA scripting + Triton + manual
├─ CTF challenge → angr symbolic execution often fastest
└─ Malware analysis → dynamic (debugger + API monitor) first
10. TOOLBOX
| Tool | Purpose | Best For |
|---|---|---|
| IDA Pro + Hex-Rays | Disassembly, decompilation, scripting | All-around analysis |
| Ghidra | Free alternative with scripting (Java/Python) | Budget-friendly RE |
| D-810 (IDA plugin) | Automated CFF deflattening | OLLVM-style obfuscation |
| miasm | IR-based analysis framework | Symbolic deobfuscation |
| Triton | Dynamic symbolic execution | Opaque predicate solving, CFF |
| REVEN | Full-system trace recording and replay | VM protector analysis |
| demovfuscator | movfuscator reversal | mov-only binaries |
| x64dbg + plugins | Dynamic analysis with scripting | Windows RE |
| Unicorn Engine | CPU emulation | SMC unpacking, shellcode |
| Capstone | Disassembly library | Custom tooling |
| IDA FLIRT | Function signature matching | Identify library code in stripped binaries |
| Binary Ninja | Alternative disassembler with MLIL/HLIL | Automated analysis |