The microarchitecture of Intel, AMD and VIA CPUs

Nytro · January 28, 2012

The microarchitecture of Intel, AMD and VIA CPUs

An optimization guide for assembly programmers and

compiler makers

By Agner Fog. Copenhagen University College of Engineering.

Contents
1 Introduction ....................................................................................................................... 4
1.1 About this manual ....................................................................................................... 4
1.2 Microprocessor versions covered by this manual........................................................ 5
2 Out-of-order execution (All processors except P1, PMMX)................................................ 7
2.1 Instructions are split into µops..................................................................................... 7
2.2 Register renaming ...................................................................................................... 8
3 Branch prediction (all processors) ................................................................................... 10
3.1 Prediction methods for conditional jumps.................................................................. 10
3.2 Branch prediction in P1............................................................................................. 15
3.3 Branch prediction in PMMX, PPro, P2, and P3 ......................................................... 19
3.4 Branch prediction in P4 and P4E .............................................................................. 20
3.5 Branch prediction in PM and Core2 .......................................................................... 23
3.6 Branch prediction in Intel Nehalem ........................................................................... 25
3.7 Branch prediction in Intel Sandy Bridge .................................................................... 26
3.8 Branch prediction in Intel Atom ................................................................................. 26
3.9 Branch prediction in VIA Nano.................................................................................. 27
3.10 Branch prediction in AMD K8 and K10.................................................................... 28
3.11 Branch prediction in AMD Bobcat ........................................................................... 30
3.12 Indirect jumps on older processors ......................................................................... 31
3.13 Returns (all processors except P1) ......................................................................... 31
3.14 Static prediction ...................................................................................................... 32
3.15 Close jumps............................................................................................................ 33
4 Pentium 1 and Pentium MMX pipeline............................................................................. 35
4.1 Pairing integer instructions........................................................................................ 35
4.2 Address generation interlock..................................................................................... 39
4.3 Splitting complex instructions into simpler ones ........................................................ 39
4.4 Prefixes..................................................................................................................... 40
4.5 Scheduling floating point code .................................................................................. 41
5 Pentium Pro, II and III pipeline......................................................................................... 44
5.1 The pipeline in PPro, P2 and P3 ............................................................................... 44
5.2 Instruction fetch ........................................................................................................ 44
5.3 Instruction decoding.................................................................................................. 45
5.4 Register renaming .................................................................................................... 49
5.5 ROB read.................................................................................................................. 49
5.6 Out of order execution .............................................................................................. 53
5.7 Retirement ................................................................................................................ 54
5.8 Partial register stalls.................................................................................................. 55
5.9 Store forwarding stalls .............................................................................................. 58
5.10 Bottlenecks in PPro, P2, P3.................................................................................... 59
6 Pentium M pipeline.......................................................................................................... 61
6.1 The pipeline in PM.................................................................................................... 61
6.2 The pipeline in Core Solo and Duo ........................................................................... 62
6.3 Instruction fetch ........................................................................................................ 62
6.4 Instruction decoding.................................................................................................. 62
2
6.5 Loop buffer ............................................................................................................... 64
6.6 Micro-op fusion ......................................................................................................... 64
6.7 Stack engine............................................................................................................. 66
6.8 Register renaming .................................................................................................... 68
6.9 Register read stalls ................................................................................................... 68
6.10 Execution units ....................................................................................................... 70
6.11 Execution units that are connected to both port 0 and 1.......................................... 70
6.12 Retirement .............................................................................................................. 72
6.13 Partial register access............................................................................................. 72
6.14 Store forwarding stalls ............................................................................................ 74
6.15 Bottlenecks in PM................................................................................................... 74
7 Core 2 and Nehalem pipeline .......................................................................................... 77
7.1 Pipeline..................................................................................................................... 77
7.2 Instruction fetch and predecoding ............................................................................. 77
7.3 Instruction decoding.................................................................................................. 80
7.4 Micro-op fusion ......................................................................................................... 80
7.5 Macro-op fusion........................................................................................................ 81
7.6 Stack engine............................................................................................................. 82
7.7 Register renaming .................................................................................................... 82
7.8 Register read stalls ................................................................................................... 83
7.9 Execution units ......................................................................................................... 84
7.10 Retirement .............................................................................................................. 88
7.11 Partial register access............................................................................................. 88
7.12 Store forwarding stalls ............................................................................................ 89
7.13 Cache and memory access..................................................................................... 91
7.14 Breaking dependency chains .................................................................................. 91
7.15 Multithreading in Nehalem ...................................................................................... 92
7.16 Bottlenecks in Core2 and Nehalem......................................................................... 93
8 Sandy Bridge pipeline ..................................................................................................... 95
8.1 Pipeline..................................................................................................................... 95
8.2 Instruction fetch and decoding .................................................................................. 95
8.3 µop cache................................................................................................................. 95
8.4 Loopback buffer ........................................................................................................ 96
8.5 Micro-op fusion ......................................................................................................... 96
8.6 Macro-op fusion........................................................................................................ 96
8.7 Stack engine............................................................................................................. 97
8.8 Register allocation and renaming.............................................................................. 97
8.9 Register read stalls ................................................................................................... 98
8.10 Execution units ....................................................................................................... 98
8.11 Partial register access........................................................................................... 101
8.12 Transitions between VEX and non-VEX modes .................................................... 102
8.13 Cache and memory access................................................................................... 102
8.14 Store forwarding stalls .......................................................................................... 103
8.15 Multithreading ....................................................................................................... 103
8.16 Bottlenecks in Sandy Bridge ................................................................................. 104
9 Pentium 4 (NetBurst) pipeline........................................................................................ 106
9.1 Data cache ............................................................................................................. 106
9.2 Trace cache............................................................................................................ 106
9.3 Instruction decoding................................................................................................ 111
9.4 Execution units ....................................................................................................... 112
9.5 Do the floating point and MMX units run at half speed? .......................................... 114
9.6 Transfer of data between execution units................................................................ 117
9.7 Retirement .............................................................................................................. 119
9.8 Partial registers and partial flags............................................................................. 120
9.9 Store forwarding stalls ............................................................................................ 121
9.10 Memory intermediates in dependency chains ....................................................... 121
9.11 Breaking dependency chains ................................................................................ 123
9.12 Choosing the optimal instructions ......................................................................... 123
3
9.13 Bottlenecks in P4 and P4E.................................................................................... 126
10 Intel Atom pipeline....................................................................................................... 129
10.1 Instruction fetch .................................................................................................... 129
10.2 Instruction decoding.............................................................................................. 129
10.3 Execution units ..................................................................................................... 129
10.4 Instruction pairing.................................................................................................. 130
10.5 X87 floating point instructions ............................................................................... 131
10.6 Instruction latencies .............................................................................................. 132
10.7 Memory access..................................................................................................... 132
10.8 Branches and loops .............................................................................................. 133
10.9 Multithreading ....................................................................................................... 133
10.10 Bottlenecks in Atom............................................................................................ 134
11 VIA Nano pipeline........................................................................................................ 135
11.1 Performance monitor counters.............................................................................. 135
11.2 Instruction fetch .................................................................................................... 135
11.3 Instruction decoding.............................................................................................. 135
11.4 Instruction fusion................................................................................................... 135
11.5 Out of order system .............................................................................................. 136
11.6 Execution ports ..................................................................................................... 136
11.7 Latencies between execution units ....................................................................... 137
11.8 Partial registers and partial flags........................................................................... 139
11.9 Breaking dependence ........................................................................................... 139
11.10 Memory access................................................................................................... 140
11.11 Branches and loops ............................................................................................ 140
11.12 VIA specific instructions ...................................................................................... 140
11.13 Bottlenecks in Nano............................................................................................ 141
12 AMD K8 and K10 pipeline ........................................................................................... 142
12.1 The pipeline in AMD processors ........................................................................... 142
12.2 Instruction fetch .................................................................................................... 144
12.3 Predecoding and instruction length decoding........................................................ 144
12.4 Single, double and vector path instructions........................................................... 145
12.5 Stack engine......................................................................................................... 146
12.6 Integer execution pipes......................................................................................... 146
12.7 Floating point execution pipes............................................................................... 146
12.8 Mixing instructions with different latency ............................................................... 148
12.9 64 bit versus 128 bit instructions........................................................................... 149
12.10 Data delay between differently typed instructions................................................ 150
12.11 Partial register access......................................................................................... 150
12.12 Partial flag access............................................................................................... 151
12.13 Store forwarding stalls ........................................................................................ 151
12.14 Loops.................................................................................................................. 152
12.15 Cache ................................................................................................................. 152
12.16 Bottlenecks in AMD............................................................................................. 154
13 AMD Bobcat pipeline................................................................................................... 155
13.1 The pipeline in AMD Bobcat.................................................................................. 156
13.2 Instruction fetch .................................................................................................... 156
13.3 Instruction decoding.............................................................................................. 156
13.4 Single, double and complex instructions ............................................................... 156
13.5 Integer execution pipes......................................................................................... 156
13.6 Floating point execution pipes............................................................................... 156
13.7 Mixing instructions with different latency ............................................................... 157
13.8 Dependency-breaking instructions........................................................................ 157
13.9 Data delay between differently typed instructions ................................................. 157
13.10 Partial register access......................................................................................... 157
13.11 Cache ................................................................................................................. 157
13.12 Store forwarding stalls ........................................................................................ 158
13.13 Bottlenecks in Bobcat ......................................................................................... 158
13.14 Literature: ........................................................................................................... 158
4
14 Comparison of microarchitectures ............................................................................... 158
14.1 The AMD kernel.................................................................................................... 158
14.2 The Pentium 4 kernel............................................................................................ 160
14.3 The Pentium M kernel........................................................................................... 161
14.4 Intel Core 2 and Nehalem microarchitecture ......................................................... 162
14.5 Intel Sandy Bridge microarchitecture .................................................................... 163
15 Comparison of low power microarchitectures .............................................................. 164
15.1 Intel Atom microarchitecture ................................................................................. 164
15.2 VIA Nano microarchitecture .................................................................................. 164
15.3 AMD Bobcat microarchitecture.............................................................................. 164
15.4 Conclusion............................................................................................................ 164
15.5 Future trends ........................................................................................................ 166
16 Literature..................................................................................................................... 169
17 Copyright notice .......................................................................................................... 169

Download:

http://agner.org/optimize/microarchitecture.pdf

Sign In

The microarchitecture of Intel, AMD and VIA CPUs

Recommended Posts

Nytro

Join the conversation

Browse

Activity

Pages