Jump to content
Nytro

Optimizing subroutines in assembly language

Recommended Posts

Posted

Optimizing subroutines in assembly language

An optimization guide for x86 platforms

By Agner Fog. Copenhagen University College of Engineering.

Copyright © 1996 - 2011. Last updated 2011-06-08.

Contents
1 Introduction ....................................................................................................................... 4
1.1 Reasons for using assembly code .............................................................................. 5
1.2 Reasons for not using assembly code ........................................................................ 5
1.3 Microprocessors covered by this manual .................................................................... 6
1.4 Operating systems covered by this manual................................................................. 7
2 Before you start................................................................................................................. 7
2.1 Things to decide before you start programming .......................................................... 7
2.2 Make a test strategy.................................................................................................... 9
2.3 Common coding pitfalls............................................................................................. 10
3 The basics of assembly coding........................................................................................ 12
3.1 Assemblers available................................................................................................ 12
3.2 Register set and basic instructions............................................................................ 14
3.3 Addressing modes .................................................................................................... 18
3.4 Instruction code format ............................................................................................. 24
3.5 Instruction prefixes.................................................................................................... 26
4 ABI standards.................................................................................................................. 27
4.1 Register usage.......................................................................................................... 27
4.2 Data storage ............................................................................................................. 28
4.3 Function calling conventions ..................................................................................... 28
4.4 Name mangling and name decoration ...................................................................... 30
4.5 Function examples.................................................................................................... 31
5 Using intrinsic functions in C++ ....................................................................................... 33
5.1 Using intrinsic functions for system code .................................................................. 35
5.2 Using intrinsic functions for instructions not available in standard C++ ..................... 35
5.3 Using intrinsic functions for vector operations ........................................................... 35
5.4 Availability of intrinsic functions................................................................................. 35
6 Using inline assembly in C++ .......................................................................................... 36
6.1 MASM style inline assembly ..................................................................................... 37
6.2 Gnu style inline assembly ......................................................................................... 41
7 Using an assembler......................................................................................................... 44
7.1 Static link libraries..................................................................................................... 46
7.2 Dynamic link libraries................................................................................................ 46
7.3 Libraries in source code form.................................................................................... 47
7.4 Making classes in assembly...................................................................................... 48
7.5 Thread-safe functions ............................................................................................... 50
7.6 Makefiles ..................................................................................................................50
8 Making function libraries compatible with multiple compilers and platforms..................... 51
8.1 Supporting multiple name mangling schemes........................................................... 52
8.2 Supporting multiple calling conventions in 32 bit mode ............................................. 53
8.3 Supporting multiple calling conventions in 64 bit mode ............................................. 56
8.4 Supporting different object file formats...................................................................... 57
8.5 Supporting other high level languages ...................................................................... 58
9 Optimizing for speed ....................................................................................................... 59
9.1 Identify the most critical parts of your code ............................................................... 59
9.2 Out of order execution .............................................................................................. 59
2
9.3 Instruction fetch, decoding and retirement ................................................................ 62
9.4 Instruction latency and throughput ............................................................................ 63
9.5 Break dependency chains......................................................................................... 64
9.6 Jumps and calls ........................................................................................................ 65
10 Optimizing for size......................................................................................................... 72
10.1 Choosing shorter instructions.................................................................................. 72
10.2 Using shorter constants and addresses .................................................................. 74
10.3 Reusing constants .................................................................................................. 75
10.4 Constants in 64-bit mode ........................................................................................ 75
10.5 Addresses and pointers in 64-bit mode................................................................... 75
10.6 Making instructions longer for the sake of alignment............................................... 77
10.7 Using multi-byte NOPs for alignment ...................................................................... 80
11 Optimizing memory access............................................................................................ 80
11.1 How caching works................................................................................................. 81
11.2 Trace cache............................................................................................................ 82
11.3 µop cache............................................................................................................... 82
11.4 Alignment of data.................................................................................................... 82
11.5 Alignment of code ................................................................................................... 85
11.6 Organizing data for improved caching..................................................................... 86
11.7 Organizing code for improved caching.................................................................... 87
11.8 Cache control instructions....................................................................................... 87
12 Loops ............................................................................................................................ 87
12.1 Minimize loop overhead .......................................................................................... 88
12.2 Induction variables.................................................................................................. 90
12.3 Move loop-invariant code........................................................................................ 91
12.4 Find the bottlenecks................................................................................................ 92
12.5 Instruction fetch, decoding and retirement in a loop ................................................ 92
12.6 Distribute µops evenly between execution units...................................................... 93
12.7 An example of analysis for bottlenecks on PM........................................................ 93
12.8 Same example on Core2 ........................................................................................ 97
12.9 Same example on Sandy Bridge............................................................................. 98
12.10 Loop unrolling ....................................................................................................... 99
12.11 Optimize caching ................................................................................................ 101
12.12 Parallelization ..................................................................................................... 102
12.13 Analyzing dependences...................................................................................... 104
12.14 Loops on processors without out-of-order execution........................................... 108
12.15 Macro loops ........................................................................................................ 109
13 Vector programming.................................................................................................... 111
13.1 Conditional moves in SIMD registers .................................................................... 113
13.2 Using vector instructions with other types of data than they are intended for ........ 116
13.3 Shuffling data........................................................................................................ 117
13.4 Generating constants............................................................................................ 121
13.5 Accessing unaligned data ..................................................................................... 123
13.6 Using AVX instruction set and YMM registers ....................................................... 127
13.7 Vector operations in general purpose registers ..................................................... 131
14 Multithreading.............................................................................................................. 133
14.1 Hyperthreading ..................................................................................................... 133
15 CPU dispatching.......................................................................................................... 134
15.1 Checking for operating system support for XMM and YMM registers .................... 135
16 Problematic Instructions .............................................................................................. 136
16.1 LEA instruction (all processors)............................................................................. 136
16.2 INC and DEC........................................................................................................ 137
16.3 XCHG (all processors) .......................................................................................... 137
16.4 Shifts and rotates (P4) .......................................................................................... 138
16.5 Rotates through carry (all processors) .................................................................. 138
16.6 Bit test (all processors) ......................................................................................... 138
16.7 LAHF and SAHF (all processors) .......................................................................... 138
16.8 Integer multiplication (all processors).................................................................... 138
3
16.9 Division (all processors)........................................................................................ 138
16.10 String instructions (all processors) ...................................................................... 143
16.11 WAIT instruction (all processors) ........................................................................ 144
16.12 FCOM + FSTSW AX (all processors).................................................................. 145
16.13 FPREM (all processors) ...................................................................................... 146
16.14 FRNDINT (all processors)................................................................................... 146
16.15 FSCALE and exponential function (all processors) ............................................. 146
16.16 FPTAN (all processors)....................................................................................... 148
16.17 FSQRT (SSE processors)................................................................................... 148
16.18 FLDCW (Most Intel processors) .......................................................................... 148
17 Special topics .............................................................................................................. 149
17.1 XMM versus floating point registers ...................................................................... 149
17.2 MMX versus XMM registers .................................................................................. 150
17.3 XMM versus YMM registers .................................................................................. 150
17.4 Freeing floating point registers (all processors)..................................................... 150
17.5 Transitions between floating point and MMX instructions...................................... 151
17.6 Converting from floating point to integer (All processors) ...................................... 151
17.7 Using integer instructions for floating point operations .......................................... 152
17.8 Using floating point instructions for integer operations .......................................... 155
17.9 Moving blocks of data (All processors).................................................................. 156
17.10 Self-modifying code (All processors) ................................................................... 157
18 Measuring performance............................................................................................... 158
18.1 Testing speed ....................................................................................................... 158
18.2 The pitfalls of unit-testing ...................................................................................... 160
19 Literature..................................................................................................................... 160
20 Copyright notice .......................................................................................................... 160

Download:

http://agner.org/optimize/optimizing_assembly.pdf

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.



×
×
  • Create New...