1. Required Knowledge to Write in Assembly
1. Application Binary Interface
→ (ABI): Function/OS interop-
eration
(a) Argument passing
(b) Stack handling
(c) Register conventions
2. Instruction Set Architecture
→ (ISA): ISA actually hex inst
formats,
but most assem-
blers use suggested mneu-
monics
→ These are the instructions
that you must build programs
out of
3. Registers/flags
4. Assembler used (gas):
(a) Assembler
directives
(pre-
fixed by .)
(b) Operand order (dest, src1,
src2)
(c) Const identifier (#)
5. For ARM, what MODE you are in
• Has thumb mode, where inst
are 16 bytes (not covered)
• Has mixed mode, where inst
16 or 32 bytes (not cov)
• Has 32 byte ARM mode (ev-
erything we do)
2. Further Resources
• ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition
– http://www.ecs.syr.edu/faculty/yin/teaching/CIS700-sp11/arm_architecture_reference_manu
• ATLAS assembly page (from [links] on class homepage):
– http://math-atlas.sourceforge.net/devel/assembly/
• ATLAS architecture page (from [links] on class homepage):
– http://math-atlas.sourceforge.net/devel/arch/
3. Linux/ARM Calling Sequence and Stack Frame
• Stack grows downward in mem-
ory
• Caller puts callees’ args in its
frame
• Frame 8-byte (64 bit) aligned
(can be 4-byte aligned for leaf
node)
• If callee needs no scratch space,
can leave SP unmodified
• Otherwise, subtract frame size
from SP, keeping 4-byte aligned
Caller’s frame
last overflow arg
...
SP
1st overflow arg
Stack frame passed to callee
• Float/dlb passed in iregs, then
overflow to stack
• Doubles never partially in iregs,
kept 8-byte aligned on stack
4. ARM has 16 (14) Integer Registers
CALLEE
REGISTER
USAGE
SAVE
r0-r1
para0/1, return value
NO
r2-r3
para2/3
NO
r4-r11
General
YES
r12
(IP) scratch reg
NO
r13
(SP) stack ptr
YES
r14
(LR) Link register (ret @)
NO
r15
(PC) program counter
NO
CPSR
status register
NO
• IP used by linker, but not within routine
• Jump back to LR at end of func
5. ARM Floating Point Registers
• FPU is optional, diff versions have different # of regs
– VFP-v2 has 32 floats (s0-s31) and 16 doubles (d0-d15)
– VFP-v3 has 32 of each (s0-s31 / d0-d31)
– SIMD uses q0-q15
• s0-s15 (d0-d7, q0-q3) are caller-saved (scratch)
• s16-s31 (d8-d15, q4-q7) are callee-saved
• d16-d31 (q8-q15) are caller-saved (scratch)
• PFSCR : bits 28-31 are conditions bits, 8-12 are exception bits,
22-23 are rounding mode bits, 24 controls flush to zero, 16-18 are
length bits, 20-21 are stride bits
– All except condition bits are callee saved.
• Float/dlb passed in iregs, then overflow to stack
• Doubles never partially in iregs, kept 8-byte aligned on stack
6. GNU/Linux/ARM Integer Overview
• Three operand assembler: op<pred>[s] <dest> <src1> <shft>
– pred: EQ, NE, GE, LT, GT, LE, LS, AL, CS, CC, MI, PL, VS, VC, HI
– s suffix means update cond codes
• dest,src must be registers (shft also reg for fops)
All iops take shft, which can be:
• An 8-bit constant rotated by
2*immediate
• A register with or w/o rotation
(any # of bits)
• ADD R8, R5, R4, LSL #2
– R8 = R5 + 4*R4
• ADD R8, R5, R4, LSR #3
– R8 = R5 + R4/8
• add r0, r0, r1, LSL r2
– R0 += R1 << r2
• add r0, r2, r3, LSL r4
– R0 = R2 + (R3 << R4)
Shift meanings
• LSL : Logical Shift Left, 0s filled
in vacated bits
• LSR : Logical Shifr Right, 0s
filled in vacated bits
• ASR : shift to right, fill vacated
pos wt unchanged sign bit
• ROR : ROtate Right,
bits
shifted off one end into other
• RRX : 33-bit rotate (we’re not
covering it)
7. ARM Integer Load Operations
Mnm
Operands
Action
Simple loads
ldr
rd, [rs]
rd = *rs
ldr
rd, [rs #±imm12]
rd = *(rs ± imm12)
ldr
rd, [rs1, ±rs2]
rd = *(rs1 ± rs2)
ldr
rd, [rs1, ±shft]
rd = *(rs1+shft)
Pre-increment loads
ldr
rd, [rs #±imm12]! rs = rs±imm12; rd = *rs
ldr
rd, [rs1, ±rs2]!
rs1 = rs1 ± rs2; rd = *rs
ldr
rd, [rs1, ±shft]!
rs1 = rs1±shft; rd = *rs
Post-increment loads
ldr
rd, [rs], #±imm12 rd = *rs; rs = rs ± imm12)
ldr
rd, [rs1], ±rs2
rd = *rs1; rs1 = rs1 ± rs2
ldr
rd, [rs1], ±shft
rd = *rs1; rs1 = rs ± shft
• ldr : LoaD Register (32 bits)
• ldrb : same for loading single byte (8 bits)
• suffix wt pred for predicated operation
• If predicate not true, val not loaded and rs1 not updated
• imm12 : 0 - 4095
• shft is all options shown on slide 6
8. ARM Integer Store Operations
Mnm
Operands
Action
Simple stores
str
rs, [ra]
*ra = rs
str
rs, [ra #±imm12]
*(ra ± imm12) = rs
str
rs, [ra1, ±ra2]
*(ra1 ± ra2) = rs
str
rs, [ra1, ±shft]
*(ra1+shft) = rs
Pre-increment stores
str
rs, [ra #±imm12]! ra = ra±imm12; *ra = rs
str
rs, [ra1, ±rs2]!
ra1 = ra1 ± ra2; *ra = rs
str
rs, [ra1, ±shft]!
ra1 = ra1±shft; *ra = rs
Post-increment loads
str
rs, [ra], #±imm12 *ra = rs; ra = ra ± imm12)
str
rs, [ra1], ±rs2
*ra = rs; ra1 = ra1 ± ra2
str
rs, [ra1], ±shft
*ra = rs; ra1 = ra1 ± shft
9. ARM LD/ST multiple
ARM has ability to load/store any subset or all registers at once:
• stm[IB,IA,DB,DA], ra[!], {reg list}
• ldm[IB,IA,DB,DA], ra[!], {reg list}
register list is an increasing set of registers, specified individually or by
ranges:
• {r2,r5-r11,r14} // ld/st r2,r5,r6,r7,r8,r9,r10,r11,r14
• For each reg in list, size(RL) = 4*nreg
• Low reg # is stored to low part of memory
The suffixes indicate how to form the address and what to store of ! is
set. EA will be the address accessed, while UA will be the address that is
written to ra if it has the ! suffix:
suff
Meaning
Addresses
IB
Increment Before
EA = ra+4; UA = ra+size(RL)
IA
Increment After
EA = ra; UA = ra+size(RL)
DB
Decrement Before
EA = UA = ra-size(RL)
DA
Decrement After
EA = ra-size(RL)+4; UA = ra-size(RL)
10. LD/ST Examples: Saving and restoring Integer Registers
Saving all callee-saved registers and restoring with 1 inst:
PROLOGUE:
stmDB SP!, {r4-r11,r13}
// SP -= #of regs*4, save all callee-saved iregs
....
DONE:
ldmIA SP!, {r4-r11,r13}
// restore all callee-saved registers
bx
// jump to link reg, restoring PC (R15)
Example of mixed operation:
PROLOGUE:
str r6, [SP #-4]!
// ST -= 4; *ST = r6
str r5, [SP #-4]!
// ST -= 4; *ST = r5
sub SP, SP, #4
// ST -= 4
str r4, [SP]
// *ST = r4
DONE:
ldmIA [SP], {r4,r5}
// restore r4 & r5, leave SP unchanged
ldr [SP #8], r6
// restore r6, SP unchanged
add SP, SP, #12
// restore SP
bx
// jump to link reg, restoring PC (R15)
11. Common Integer Arithmetic Operations
Mnm
Operands
Action
add
rd, rs, shft
rd = rs + shft
sub
rd, rs, shft
rd = rs - shft
rsb
rd, rs, shft
rd = shft - rs
mul
rd, rs1, rs2
rd = rs1 * rs2
mla
rd,rs1,rs2,rs3
rd = rs1*rs2 + rs3
umull
rdlo,rdhi,rs1,rs2
(rdhi,rdlo) = rs1*rs2 (unsigned)
smull
rdlo,rdhi,rs1,rs2
(rdhi,rdlo) = rs1*rs2 (signed)
umlal
rdlo,rdhi,rs1,rs2
(rdhi,rdlo) = rs1*rs2 + rdlo(unsigned)
smlal
rdlo,rdhi,rs1,rs2
(rdhi,rdlo) = rs1*rs2 + rdlo(signed)
• AFAIK, no integer division on ARM!
• can suffix for predication
• suffixing with ’S’ make them update the condition codes
12. Common Bit-Level Operations
Mnemonic
Operands
Action
mov
rd, shft
rd = shft
mvn
rd, shft
rd = ~(shft)
and
rd, rs, shft
rd = rs & (shft)
orr
rd, rs, shft
rd = rs | (shft)
eor
rd, rs, shft
rd = rs ^ (shft), (if shft=rs, zero!)
bic
rd, rs, shft
rd = rs & ~(shft)
clz
rd, rs
rd = # of leading zeros (most sig bits) in rs
• can suffix for predication
• suffixing with ’S’ make them update the condition codes (inc mov)
13. ARM Integer Condition Codes
• Condition codes signalled in 4 most sig bits of current program
status registers (CPSR)
• Can be set by most integer ops with ‘S’ suffix
Condition flag bits explanation:
• N (bit 31): set to sign bit of result
• Z (bit 30): result of op is zero
• C (bit 29): Carry bit; has several cases:
1. For ADD or CMN, 1 if add produces a carry (unsigned overflow)
2. For SUB or CMP, C is set to 0 if sub produces a barrow (unsigned
underflow), else 1.
3. For most other inst, set to last bit shifted out of the value by
the shifter
• V (bit 28): set to 1 if overflow occured in add or sub
14. Condition code/predicate mnemonic
pred
Flag
Mnem
Meaning
Test
EQ
equal
Z=1
NE
not equal
Z=0
CS/HS
Carry set/unsigned higher or same
C=1
CC/LO
Carry clear/unsigned lower
C=0
MI
MInus/negative
N=1
PL
PLus/positive or zero
N=0
VS
Overflow (V Set)
V=1
VC
no overflow (V Clear)
V=0
HI
Unsigned higher
C=1, Z=0
LS
Unsigned lower or same
C=0 or Z=1
GE
Signed greater than or equal
(N==V)
LT
Signed less than
(N6=V)
GT
Signed grater than
(Z=0, N==V)
LE
Signed less than or equal
(Z=1 or N6=V)
AL
always
ignored
15. Common ARM Comparison and Branch Instructions
Mnemonic
Operands
Action
cmp
rs, shft
Set CC as if rs - shft
cmn
rs, shft
Set CC as if mrid + rs
tst
rs, shft
Set CC as if mris & rid
teq
rs, shft
Set CC as if mris ^ rid
Mnemonic
Operands
Action
B
label
jump to label
BL
label
R14 (link reg) = next inst; jump to label
BX
rs
jump to (rs & 0xFFFFFFFE); if low bit is 0
ARM mode, else THUMB mode
BLX
addr
Not covered (jump to thumb func)
• Can do comparison early, branch later (no intevening iops must set CC)
• All branches predicated like every other inst using suffixes of slide 12
• Can return from func called with BL by MOV PC,R14
16. ARM Floating Point Introduction (VFP)
• Almost completely IEEE compliant
• Has logical vector through banks, we won’t cover
• floats in registers s0-s31
• doubles in regs d0-d15 (overlapped with s0-s31), and sometimes d16-
d31
• Inst suffixed with ‘s’ do floats, ’d’ handle doubles
• After precision suffix fpinst take usual predicate suffixes which use iCC
• Has three new system registers:
FPSCR : status (comparison results & exception flags) and control
(set vector length/stride, rounding mode, traps, etc) bits
FPSID : read-only register IDing VFP architecture
FPEXC : contains a few bits for system-level status & control
17. FPSCR Information
Status bits:
31 : N: 1 if comparison produced a less than result
30 : Z: 1 if comparison produced a equal result
29 : C: 1 if cmp is equal, greater than or unordered
28 : V: 1 if comparison produced an unordered result (NaN)
Control bits:
24 FZ: 0: IEEE compliant, 1: flush-to-zero enabled
• 23:22 : Set IEEE rounding mode
• 18:16 : Set vector mode: set to 000 for scalar operations
• 12:8 : trap enable bits for fp exceptions given below
4 : IXC - Inexact Exceptions (non-zero rounding occurred)
3 : UFC - Underflow Exceptions (result to small in magnitude)
2 : OFC - Overflow Exceptions (result too large in magnitude)
1 : DZC - Division by Zero
0 : IOC - Invalid Operation (NaN result)
⇒ Use FMRX & FMXR to manipulate (next slide)
18. ARM FP ld/st/move
Mnem
Operands
Action
fld
fd, [ra]
fd = *(ra)
fld
fd, [ra, ± imm8*4]
fd = *(ra ± imm8*4)
fst
fs, [ra]
*(ra) = fd
fst
fs, [ra, ± imm8*4]
*(ra ± imm8*4) = fd
fcpy
fd, fs
fd = fs
fabs
fd, fs
fd = abs(fs)
fneg
fd, fs
fd = -fs
fmrs
rd, fs
rd=fs (bit transfer, no fp-to-int conversion)
fmsr
fd, rs
fd=rs (bit transfer, no fp-to-int conversion)
fmd[l,h]r
fd, rs
xfer ireg to low or high part of double reg
fmrd[l,h]
rd, fs
xfer upper or lower half of double fp to ireg
fcvtds
dd, fs
convert float to double
fctsd
sd, ds
convert double to float
fmstat
none
move FPSCR’s N,Z,C,V flag to integer CC of same name
fmrx
rd, sysreg
move FPSID, PFSCR, or PFEXC to ireg rd
fmxr
sysreg, rs
mov ireg rs to FPSID, PFSCR, or PFEXC
• imm8*4 is written #N, where N is multiple of 4: fldd rd, [PTR,#16]
19. ARM FP LD/ST multiple
FP LD/ST multiple can load contiguous registers only
• fldm[IA,DB]_, ra[!], {reg list}
• fstm[IA,DB]_, ra[!], {reg list}
register list is an contiguous increasing set of registers, specified by ranges:
• {s5-s11} // ld/st s5,s6,s7,s8,s9,s11
• For each sreg in list, size(RL) = 4*nreg
• For each dreg in list, size(RL) = 8*nreg
• Low reg # is stored to low part of memory
The suffixes indicate how to form the address and what to store if ! is
set. EA will be the address accessed, while UA will be the address that is
written to ra if it has the ! suffix:
suff
Meaning
Addresses
IA
Increment After
EA = ra; UA = ra+size(RL)
DB
Decrement Before
EA = UA = ra-size(RL)
20. ARM FP LD/ST multiple w/o type
FP LD/ST multiple X can load contiguous registers regardless of type
(useful for prologue/epilogue):
• fldm[IA,DB]x, ra[!], {reg list}
• fstm[IA,DB]x, ra[!], {reg list}
register list is an contiguous increasing set of double registers, specified
by ranges:
• {d3-d6} // ld/st d3-d6 regardless of int/float/double
• For each dreg in list, size(RL) = 8*nreg
• Low reg # is stored to low part of memory
The suffixes indicate how to form the address and what to store if ! is
set. EA will be the address accessed, while UA will be the address that is
written to ra if it has the ! suffix:
suff
Meaning
Addresses
IA
Increment After
EA = ra; UA = ra+size(RL)
DB
Decrement Before
EA = UA = ra-size(RL)
21. Common ARM Floating Point Computation Instructions
Mnemonic
Operands
Action
fmac
fd, fs1, fs2
fd += fs1*fs2
fnmac
fd, fs1, fs2
fd -= fs1*fs2
fmsc
fd, fs1, fs2
fd = fs1*fs2 - fd
fnmsc
fd, fs1, fs2
fd = -fs1*fs2 - fd
fmul
fd, fs1, fs2
fd = fs1*fs2
fnmul
fd, fs1, fs2
fd = -(fs1*fs2)
fdiv
fd, fs1, fs2
fd = fs1/fs2
fadd
fd, fs1, fs2
fd = fs1 + fs2
fsub
fd, fs1, fs2
fd = fs1 - fs2
fsqrt
fd, fs
fd =
√
f s
22. ARM Floating Point Comparison Instructions
Mnem
Ops
exp
fcmp
rd, rs
compare rd and rs
fcmpe
rd, rs
fcmp; raise inval
op exception on NaN
fcmpz
rd
compare against 0 (rs=0)
fcmpez
rd
rs=0 raise exc on NaN
FPSCR:
31
30
29
28
fcmp
N
Z
C
V
rd > rs
0
0
1
0
rd < rs
1
0
0
0
rd = rs
0
1
1
0
is(NaN)
0
0
0
1
• Use fmstat to move from FPSCR to iCC for predication
• Can fmxr FPSCR, rd for bit level operations
23. Simple ZIAMAX in ARM Assembly
#define x0
d0
#define x1
d1
#define maxval
d2
#define sum
d3
#define N
r0
#define X
r1
#define maxX
r2
#define XX
r3
/*
r0
r1
*int ATL_UIAMAX(int N, const TYPE *X,
*
const int incX)
*/
#include "atlas_asm.h"
.text
.code 32
.globl ATL_UIAMAX
ATL_UIAMAX:
mov maxX, X
mov XX, X
fldmIAd X!, {x0,x1}
/* load real&imag, update X ptr */
fabsd x0, x0
fabsd x1, x1
faddd maxval, x0, x1
subs N, N, #1
bEQ DONE
/*
* for (maxval=0.0;i=0; i < N; i++)
* {
*
sum = abs(x[2*i]) + abs(x[2*i+1])
*
if (sum > maxval) { maxval=sum, iret=i
* }
* return(iret);
*/
LOOP:
fldmIAd X!, {x0,x1}
fabsd x0, x0
fabsd x1, x1
faddd
sum, x0, x1
pld [X,#168]
/* prefetch */
fcmpd maxval, sum
/* N=1 iff maxval <
fmstat
/* set iCC */
fcpydMI maxval, sum
subMI maxX, X, #16
subs N, N, #1
bNE LOOP
DONE:
sub r0, maxX, XX
mov r0, r0, LSR #4
bx lr
24. Safe NRM2 in ARM assembly pt I
#define
N
r0
#define pX
r1
#define fpsav
r2
#define N0
r3
#define XX
r12
#ifdef SREAL
#define sx0
s0
#define ssum0
s1
#define zero
s2
#define scal
s3
#define SO 4
#else
#define sx0
d0
#define ssum0
d1
#define scal
d2
#define zero
d3
#define fdivs fdivd
#define flds
fldd
#define fmacs fmacd
#define fabss fabsd
#define fcmps fcmpd
#define fcpysMI fcpydMI
#define fldsNE flddNE
#define fsqrts fsqrtd
#define fmuls fmuld
#define fcpys fcpyd
#define SO 8
#endif
/*
r0
r1
* TYPE ATL_UNRM2(int N, const TYPE *X,
*
const int incX)
*
r2
*/
.text
.code 32
.globl ATL_UNRM2
ATL_UNRM2:
fmrx fpsav, FPSCR
/* save original FPSCR */
mvn
N0, #0xF
/* N0 = 0xFFFFFFF0 */
and
N0, N0, fpsav
/* zero exception bits */
bic
N0, N0, #(1<<24)/* turn off flush-to-zero mode */
fmxr FPSCR, N0
mov
N0, #0
#ifdef DREAL
fmdlr zero, N0
fmdhr zero, N0
#else
fmsr zero, N0
#endif
fcpys ssum0, zero
fcpys scal, zero
mov N0, N
mov XX, pX
25. Safe NRM2 in ARM assembly pt II
LOOP1:
flds sx0, [pX]
fmacs ssum0, sx0, sx0
fabss sx0, sx0
fcmps scal, sx0
pld [pX,#96]
fmstat
fcpysMI scal, sx0
subs N, N, #1
add pX, pX, #SO
bNE LOOP1
DONE:
/*
* If over/underflow happened redo
*/
fmrx r1, FPSCR
tst r1, #0xF
fmxr FPSCR, fpsav /* restore FPSCR */
bNE SSQ
fsqrts ssum0, ssum0
#ifdef DREAL
fmrdl r0, ssum0
fmrdh r1, ssum0
#else
fmrs r0, ssum0
#endif
bx lr
SSQ:
mov pX, XX
mov N, N0
fcpys ssum0, zero
SSQLOOP:
flds sx0, [pX]
fabss sx0, sx0
fdivs sx0, sx0, scal
pld [pX,#96]
fmacs ssum0, sx0, sx0
subs N, N, #1
add pX, pX, #SO
bNE SSQLOOP
fsqrts ssum0, ssum0
fmuls ssum0, ssum0, scal
#ifdef DREAL
fmrdl r0, ssum0
fmrdh r1, ssum0
#else
fmrs r0, ssum0
#endif
bx lr