ARM / AArch64 Assembly

Purpose

Guide agents through AArch64 (64-bit) and ARM (32-bit Thumb) assembly: registers, calling conventions, inline asm, and NEON/SVE SIMD patterns.

Triggers

"How do I read ARM64 assembly output?"
"What are the AArch64 registers and calling convention?"
"How do I write inline asm for ARM?"
"What is the difference between AArch64 and ARM Thumb?"
"How do I use NEON intrinsics?"

Workflow

1. Generate ARM assembly

# AArch64 (native or cross-compile)
aarch64-linux-gnu-gcc -S -O2 foo.c -o foo.s

# 32-bit ARM Thumb
arm-linux-gnueabihf-gcc -S -O2 -mthumb foo.c -o foo.s

# From objdump
aarch64-linux-gnu-objdump -d -S prog

# From GDB on target
(gdb) disassemble /s main

2. AArch64 registers (AAPCS64)

Register	Alias	Role
`x0`–`x7`	—	Arguments 1–8 and return values
`x8`	`xr`	Indirect result location (struct return)
`x9`–`x15`	—	Caller-saved temporaries
`x16`–`x17`	`ip0`, `ip1`	Intra-procedure-call temporaries (used by linker)
`x18`	`pr`	Platform register (reserved on some OS)
`x19`–`x28`	—	Callee-saved
`x29`	`fp`	Frame pointer (callee-saved)
`x30`	`lr`	Link register (return address)
`sp`	—	Stack pointer (must be 16-byte aligned at call)
`pc`	—	Program counter (not directly accessible)
`xzr`	`wzr`	Zero register (reads as 0, writes discarded)
`v0`–`v7`	`q0`–`q7`	FP/SIMD args and return
`v8`–`v15`	—	Callee-saved SIMD (lower 64 bits only)
`v16`–`v31`	—	Caller-saved temporaries

Width variants: x0 (64-bit), w0 (32-bit, zero-extends to 64), h0 (16), b0 (8).

3. AAPCS64 calling convention

Integer/pointer args: x0–x7 Float/SIMD args: v0–v7 Return: x0 (int), x0+x1 (128-bit), v0 (float/SIMD) Callee-saved: x19–x28, x29 (fp), x30 (lr), v8–v15 (lower 64 bits) Caller-saved: everything else

Stack must be 16-byte aligned at any bl or blr instruction.

4. Common AArch64 instructions

Instruction	Effect
`mov x0, x1`	Copy register
`mov x0, #42`	Load immediate
`movz x0, #0x1234, lsl #16`	Move zero-extended with shift
`movk x0, #0xabcd`	Move with keep (partial update)
`ldr x0, [x1]`	Load 64-bit from address in x1
`ldr x0, [x1, #8]`	Load from x1+8
`str x0, [x1, #8]`	Store x0 to x1+8
`ldp x0, x1, [sp, #16]`	Load pair (two regs at once)
`stp x29, x30, [sp, #-16]!`	Store pair, pre-decrement sp
`add x0, x1, x2`	x0 = x1 + x2
`add x0, x1, #8`	x0 = x1 + 8
`sub x0, x1, x2`	x0 = x1 - x2
`mul x0, x1, x2`	x0 = x1 * x2
`sdiv x0, x1, x2`	Signed divide
`udiv x0, x1, x2`	Unsigned divide
`cmp x0, x1`	Set flags for x0 - x1
`cbz x0, label`	Branch if x0 == 0
`cbnz x0, label`	Branch if x0 != 0
`bl func`	Branch with link (call)
`blr x0`	Branch with link to address in x0
`ret`	Return (branch to x30)
`ret x0`	Return to address in x0
`adrp x0, symbol`	PC-relative page address
`add x0, x0, :lo12:symbol`	Low 12 bits of symbol offset

5. Typical function prologue/epilogue

// Non-leaf function
stp  x29, x30, [sp, #-32]!   // save fp, lr; allocate 32 bytes
mov  x29, sp                  // set frame pointer
stp  x19, x20, [sp, #16]     // save callee-saved registers
// ... body ...
ldp  x19, x20, [sp, #16]     // restore
ldp  x29, x30, [sp], #32     // restore fp, lr; deallocate
ret

// Leaf function (no calls, no callee-saved regs needed)
// Can use red zone (no rsp adjustment) — but AArch64 has no red zone
sub  sp, sp, #16             // allocate locals
// ... body ...
add  sp, sp, #16
ret

6. Inline assembly (GCC/Clang)

// Barrier
__asm__ volatile ("dmb ish" ::: "memory");

// Load acquire
static inline int load_acquire(volatile int *p) {
    int val;
    __asm__ volatile ("ldar %w0, %1" : "=r"(val) : "Q"(*p));
    return val;
}

// Store release
static inline void store_release(volatile int *p, int val) {
    __asm__ volatile ("stlr %w1, %0" : "=Q"(*p) : "r"(val));
}

// Read system counter
static inline uint64_t read_cntvct(void) {
    uint64_t val;
    __asm__ volatile ("mrs %0, cntvct_el0" : "=r"(val));
    return val;
}

AArch64-specific constraints:

"Q" — memory operand suitable for exclusive/acquire/release instructions
"r" — any general-purpose register
"w" — any FP/SIMD register

7. NEON SIMD intrinsics

#include <arm_neon.h>

// Add 4 floats at once
float32x4_t a = vld1q_f32(arr_a);   // load 4 floats
float32x4_t b = vld1q_f32(arr_b);
float32x4_t c = vaddq_f32(a, b);
vst1q_f32(result, c);

// Horizontal sum
float32x4_t sum = vpaddq_f32(c, c);
sum = vpaddq_f32(sum, sum);
float total = vgetq_lane_f32(sum, 0);

Naming convention: v<op><q>_<type>

q suffix: 128-bit (quad) vector
_f32: float32, _s32: int32, _u8: uint8, etc.

For a register reference, see references/reference.md.

Related skills

Use skills/low-level-programming/assembly-x86 for x86-64 assembly
Use skills/compilers/cross-gcc for cross-compilation toolchain
Use skills/debuggers/gdb for debugging ARM code with gdbserver

assembly-arm