Nytro Posted January 28, 2012 Report Posted January 28, 2012 The microarchitecture of Intel, AMD and VIA CPUsAn optimization guide for assembly programmers andcompiler makersBy Agner Fog. Copenhagen University College of Engineering.Copyright © 1996 - 2011. Last updated 2011-06-08.Contents1 Introduction ....................................................................................................................... 41.1 About this manual ....................................................................................................... 41.2 Microprocessor versions covered by this manual........................................................ 52 Out-of-order execution (All processors except P1, PMMX)................................................ 72.1 Instructions are split into µops..................................................................................... 72.2 Register renaming ...................................................................................................... 83 Branch prediction (all processors) ................................................................................... 103.1 Prediction methods for conditional jumps.................................................................. 103.2 Branch prediction in P1............................................................................................. 153.3 Branch prediction in PMMX, PPro, P2, and P3 ......................................................... 193.4 Branch prediction in P4 and P4E .............................................................................. 203.5 Branch prediction in PM and Core2 .......................................................................... 233.6 Branch prediction in Intel Nehalem ........................................................................... 253.7 Branch prediction in Intel Sandy Bridge .................................................................... 263.8 Branch prediction in Intel Atom ................................................................................. 263.9 Branch prediction in VIA Nano.................................................................................. 273.10 Branch prediction in AMD K8 and K10.................................................................... 283.11 Branch prediction in AMD Bobcat ........................................................................... 303.12 Indirect jumps on older processors ......................................................................... 313.13 Returns (all processors except P1) ......................................................................... 313.14 Static prediction ...................................................................................................... 323.15 Close jumps............................................................................................................ 334 Pentium 1 and Pentium MMX pipeline............................................................................. 354.1 Pairing integer instructions........................................................................................ 354.2 Address generation interlock..................................................................................... 394.3 Splitting complex instructions into simpler ones ........................................................ 394.4 Prefixes..................................................................................................................... 404.5 Scheduling floating point code .................................................................................. 415 Pentium Pro, II and III pipeline......................................................................................... 445.1 The pipeline in PPro, P2 and P3 ............................................................................... 445.2 Instruction fetch ........................................................................................................ 445.3 Instruction decoding.................................................................................................. 455.4 Register renaming .................................................................................................... 495.5 ROB read.................................................................................................................. 495.6 Out of order execution .............................................................................................. 535.7 Retirement ................................................................................................................ 545.8 Partial register stalls.................................................................................................. 555.9 Store forwarding stalls .............................................................................................. 585.10 Bottlenecks in PPro, P2, P3.................................................................................... 596 Pentium M pipeline.......................................................................................................... 616.1 The pipeline in PM.................................................................................................... 616.2 The pipeline in Core Solo and Duo ........................................................................... 626.3 Instruction fetch ........................................................................................................ 626.4 Instruction decoding.................................................................................................. 6226.5 Loop buffer ............................................................................................................... 646.6 Micro-op fusion ......................................................................................................... 646.7 Stack engine............................................................................................................. 666.8 Register renaming .................................................................................................... 686.9 Register read stalls ................................................................................................... 686.10 Execution units ....................................................................................................... 706.11 Execution units that are connected to both port 0 and 1.......................................... 706.12 Retirement .............................................................................................................. 726.13 Partial register access............................................................................................. 726.14 Store forwarding stalls ............................................................................................ 746.15 Bottlenecks in PM................................................................................................... 747 Core 2 and Nehalem pipeline .......................................................................................... 777.1 Pipeline..................................................................................................................... 777.2 Instruction fetch and predecoding ............................................................................. 777.3 Instruction decoding.................................................................................................. 807.4 Micro-op fusion ......................................................................................................... 807.5 Macro-op fusion........................................................................................................ 817.6 Stack engine............................................................................................................. 827.7 Register renaming .................................................................................................... 827.8 Register read stalls ................................................................................................... 837.9 Execution units ......................................................................................................... 847.10 Retirement .............................................................................................................. 887.11 Partial register access............................................................................................. 887.12 Store forwarding stalls ............................................................................................ 897.13 Cache and memory access..................................................................................... 917.14 Breaking dependency chains .................................................................................. 917.15 Multithreading in Nehalem ...................................................................................... 927.16 Bottlenecks in Core2 and Nehalem......................................................................... 938 Sandy Bridge pipeline ..................................................................................................... 958.1 Pipeline..................................................................................................................... 958.2 Instruction fetch and decoding .................................................................................. 958.3 µop cache................................................................................................................. 958.4 Loopback buffer ........................................................................................................ 968.5 Micro-op fusion ......................................................................................................... 968.6 Macro-op fusion........................................................................................................ 968.7 Stack engine............................................................................................................. 978.8 Register allocation and renaming.............................................................................. 978.9 Register read stalls ................................................................................................... 988.10 Execution units ....................................................................................................... 988.11 Partial register access........................................................................................... 1018.12 Transitions between VEX and non-VEX modes .................................................... 1028.13 Cache and memory access................................................................................... 1028.14 Store forwarding stalls .......................................................................................... 1038.15 Multithreading ....................................................................................................... 1038.16 Bottlenecks in Sandy Bridge ................................................................................. 1049 Pentium 4 (NetBurst) pipeline........................................................................................ 1069.1 Data cache ............................................................................................................. 1069.2 Trace cache............................................................................................................ 1069.3 Instruction decoding................................................................................................ 1119.4 Execution units ....................................................................................................... 1129.5 Do the floating point and MMX units run at half speed? .......................................... 1149.6 Transfer of data between execution units................................................................ 1179.7 Retirement .............................................................................................................. 1199.8 Partial registers and partial flags............................................................................. 1209.9 Store forwarding stalls ............................................................................................ 1219.10 Memory intermediates in dependency chains ....................................................... 1219.11 Breaking dependency chains ................................................................................ 1239.12 Choosing the optimal instructions ......................................................................... 12339.13 Bottlenecks in P4 and P4E.................................................................................... 12610 Intel Atom pipeline....................................................................................................... 12910.1 Instruction fetch .................................................................................................... 12910.2 Instruction decoding.............................................................................................. 12910.3 Execution units ..................................................................................................... 12910.4 Instruction pairing.................................................................................................. 13010.5 X87 floating point instructions ............................................................................... 13110.6 Instruction latencies .............................................................................................. 13210.7 Memory access..................................................................................................... 13210.8 Branches and loops .............................................................................................. 13310.9 Multithreading ....................................................................................................... 13310.10 Bottlenecks in Atom............................................................................................ 13411 VIA Nano pipeline........................................................................................................ 13511.1 Performance monitor counters.............................................................................. 13511.2 Instruction fetch .................................................................................................... 13511.3 Instruction decoding.............................................................................................. 13511.4 Instruction fusion................................................................................................... 13511.5 Out of order system .............................................................................................. 13611.6 Execution ports ..................................................................................................... 13611.7 Latencies between execution units ....................................................................... 13711.8 Partial registers and partial flags........................................................................... 13911.9 Breaking dependence ........................................................................................... 13911.10 Memory access................................................................................................... 14011.11 Branches and loops ............................................................................................ 14011.12 VIA specific instructions ...................................................................................... 14011.13 Bottlenecks in Nano............................................................................................ 14112 AMD K8 and K10 pipeline ........................................................................................... 14212.1 The pipeline in AMD processors ........................................................................... 14212.2 Instruction fetch .................................................................................................... 14412.3 Predecoding and instruction length decoding........................................................ 14412.4 Single, double and vector path instructions........................................................... 14512.5 Stack engine......................................................................................................... 14612.6 Integer execution pipes......................................................................................... 14612.7 Floating point execution pipes............................................................................... 14612.8 Mixing instructions with different latency ............................................................... 14812.9 64 bit versus 128 bit instructions........................................................................... 14912.10 Data delay between differently typed instructions................................................ 15012.11 Partial register access......................................................................................... 15012.12 Partial flag access............................................................................................... 15112.13 Store forwarding stalls ........................................................................................ 15112.14 Loops.................................................................................................................. 15212.15 Cache ................................................................................................................. 15212.16 Bottlenecks in AMD............................................................................................. 15413 AMD Bobcat pipeline................................................................................................... 15513.1 The pipeline in AMD Bobcat.................................................................................. 15613.2 Instruction fetch .................................................................................................... 15613.3 Instruction decoding.............................................................................................. 15613.4 Single, double and complex instructions ............................................................... 15613.5 Integer execution pipes......................................................................................... 15613.6 Floating point execution pipes............................................................................... 15613.7 Mixing instructions with different latency ............................................................... 15713.8 Dependency-breaking instructions........................................................................ 15713.9 Data delay between differently typed instructions ................................................. 15713.10 Partial register access......................................................................................... 15713.11 Cache ................................................................................................................. 15713.12 Store forwarding stalls ........................................................................................ 15813.13 Bottlenecks in Bobcat ......................................................................................... 15813.14 Literature: ........................................................................................................... 158414 Comparison of microarchitectures ............................................................................... 15814.1 The AMD kernel.................................................................................................... 15814.2 The Pentium 4 kernel............................................................................................ 16014.3 The Pentium M kernel........................................................................................... 16114.4 Intel Core 2 and Nehalem microarchitecture ......................................................... 16214.5 Intel Sandy Bridge microarchitecture .................................................................... 16315 Comparison of low power microarchitectures .............................................................. 16415.1 Intel Atom microarchitecture ................................................................................. 16415.2 VIA Nano microarchitecture .................................................................................. 16415.3 AMD Bobcat microarchitecture.............................................................................. 16415.4 Conclusion............................................................................................................ 16415.5 Future trends ........................................................................................................ 16616 Literature..................................................................................................................... 16917 Copyright notice .......................................................................................................... 169Download:http://agner.org/optimize/microarchitecture.pdf Quote