Nytro Posted January 28, 2012 Report Posted January 28, 2012 Optimizing subroutines in assembly languageAn optimization guide for x86 platformsBy Agner Fog. Copenhagen University College of Engineering.Copyright © 1996 - 2011. Last updated 2011-06-08.Contents1 Introduction ....................................................................................................................... 41.1 Reasons for using assembly code .............................................................................. 51.2 Reasons for not using assembly code ........................................................................ 51.3 Microprocessors covered by this manual .................................................................... 61.4 Operating systems covered by this manual................................................................. 72 Before you start................................................................................................................. 72.1 Things to decide before you start programming .......................................................... 72.2 Make a test strategy.................................................................................................... 92.3 Common coding pitfalls............................................................................................. 103 The basics of assembly coding........................................................................................ 123.1 Assemblers available................................................................................................ 123.2 Register set and basic instructions............................................................................ 143.3 Addressing modes .................................................................................................... 183.4 Instruction code format ............................................................................................. 243.5 Instruction prefixes.................................................................................................... 264 ABI standards.................................................................................................................. 274.1 Register usage.......................................................................................................... 274.2 Data storage ............................................................................................................. 284.3 Function calling conventions ..................................................................................... 284.4 Name mangling and name decoration ...................................................................... 304.5 Function examples.................................................................................................... 315 Using intrinsic functions in C++ ....................................................................................... 335.1 Using intrinsic functions for system code .................................................................. 355.2 Using intrinsic functions for instructions not available in standard C++ ..................... 355.3 Using intrinsic functions for vector operations ........................................................... 355.4 Availability of intrinsic functions................................................................................. 356 Using inline assembly in C++ .......................................................................................... 366.1 MASM style inline assembly ..................................................................................... 376.2 Gnu style inline assembly ......................................................................................... 417 Using an assembler......................................................................................................... 447.1 Static link libraries..................................................................................................... 467.2 Dynamic link libraries................................................................................................ 467.3 Libraries in source code form.................................................................................... 477.4 Making classes in assembly...................................................................................... 487.5 Thread-safe functions ............................................................................................... 507.6 Makefiles ..................................................................................................................508 Making function libraries compatible with multiple compilers and platforms..................... 518.1 Supporting multiple name mangling schemes........................................................... 528.2 Supporting multiple calling conventions in 32 bit mode ............................................. 538.3 Supporting multiple calling conventions in 64 bit mode ............................................. 568.4 Supporting different object file formats...................................................................... 578.5 Supporting other high level languages ...................................................................... 589 Optimizing for speed ....................................................................................................... 599.1 Identify the most critical parts of your code ............................................................... 599.2 Out of order execution .............................................................................................. 5929.3 Instruction fetch, decoding and retirement ................................................................ 629.4 Instruction latency and throughput ............................................................................ 639.5 Break dependency chains......................................................................................... 649.6 Jumps and calls ........................................................................................................ 6510 Optimizing for size......................................................................................................... 7210.1 Choosing shorter instructions.................................................................................. 7210.2 Using shorter constants and addresses .................................................................. 7410.3 Reusing constants .................................................................................................. 7510.4 Constants in 64-bit mode ........................................................................................ 7510.5 Addresses and pointers in 64-bit mode................................................................... 7510.6 Making instructions longer for the sake of alignment............................................... 7710.7 Using multi-byte NOPs for alignment ...................................................................... 8011 Optimizing memory access............................................................................................ 8011.1 How caching works................................................................................................. 8111.2 Trace cache............................................................................................................ 8211.3 µop cache............................................................................................................... 8211.4 Alignment of data.................................................................................................... 8211.5 Alignment of code ................................................................................................... 8511.6 Organizing data for improved caching..................................................................... 8611.7 Organizing code for improved caching.................................................................... 8711.8 Cache control instructions....................................................................................... 8712 Loops ............................................................................................................................ 8712.1 Minimize loop overhead .......................................................................................... 8812.2 Induction variables.................................................................................................. 9012.3 Move loop-invariant code........................................................................................ 9112.4 Find the bottlenecks................................................................................................ 9212.5 Instruction fetch, decoding and retirement in a loop ................................................ 9212.6 Distribute µops evenly between execution units...................................................... 9312.7 An example of analysis for bottlenecks on PM........................................................ 9312.8 Same example on Core2 ........................................................................................ 9712.9 Same example on Sandy Bridge............................................................................. 9812.10 Loop unrolling ....................................................................................................... 9912.11 Optimize caching ................................................................................................ 10112.12 Parallelization ..................................................................................................... 10212.13 Analyzing dependences...................................................................................... 10412.14 Loops on processors without out-of-order execution........................................... 10812.15 Macro loops ........................................................................................................ 10913 Vector programming.................................................................................................... 11113.1 Conditional moves in SIMD registers .................................................................... 11313.2 Using vector instructions with other types of data than they are intended for ........ 11613.3 Shuffling data........................................................................................................ 11713.4 Generating constants............................................................................................ 12113.5 Accessing unaligned data ..................................................................................... 12313.6 Using AVX instruction set and YMM registers ....................................................... 12713.7 Vector operations in general purpose registers ..................................................... 13114 Multithreading.............................................................................................................. 13314.1 Hyperthreading ..................................................................................................... 13315 CPU dispatching.......................................................................................................... 13415.1 Checking for operating system support for XMM and YMM registers .................... 13516 Problematic Instructions .............................................................................................. 13616.1 LEA instruction (all processors)............................................................................. 13616.2 INC and DEC........................................................................................................ 13716.3 XCHG (all processors) .......................................................................................... 13716.4 Shifts and rotates (P4) .......................................................................................... 13816.5 Rotates through carry (all processors) .................................................................. 13816.6 Bit test (all processors) ......................................................................................... 13816.7 LAHF and SAHF (all processors) .......................................................................... 13816.8 Integer multiplication (all processors).................................................................... 138316.9 Division (all processors)........................................................................................ 13816.10 String instructions (all processors) ...................................................................... 14316.11 WAIT instruction (all processors) ........................................................................ 14416.12 FCOM + FSTSW AX (all processors).................................................................. 14516.13 FPREM (all processors) ...................................................................................... 14616.14 FRNDINT (all processors)................................................................................... 14616.15 FSCALE and exponential function (all processors) ............................................. 14616.16 FPTAN (all processors)....................................................................................... 14816.17 FSQRT (SSE processors)................................................................................... 14816.18 FLDCW (Most Intel processors) .......................................................................... 14817 Special topics .............................................................................................................. 14917.1 XMM versus floating point registers ...................................................................... 14917.2 MMX versus XMM registers .................................................................................. 15017.3 XMM versus YMM registers .................................................................................. 15017.4 Freeing floating point registers (all processors)..................................................... 15017.5 Transitions between floating point and MMX instructions...................................... 15117.6 Converting from floating point to integer (All processors) ...................................... 15117.7 Using integer instructions for floating point operations .......................................... 15217.8 Using floating point instructions for integer operations .......................................... 15517.9 Moving blocks of data (All processors).................................................................. 15617.10 Self-modifying code (All processors) ................................................................... 15718 Measuring performance............................................................................................... 15818.1 Testing speed ....................................................................................................... 15818.2 The pitfalls of unit-testing ...................................................................................... 16019 Literature..................................................................................................................... 16020 Copyright notice .......................................................................................................... 160Download:http://agner.org/optimize/optimizing_assembly.pdf Quote