Jump to content
Nytro

The microarchitecture of Intel, AMD and VIA CPUs

Recommended Posts

Posted

The microarchitecture of Intel, AMD and VIA CPUs

An optimization guide for assembly programmers and

compiler makers

By Agner Fog. Copenhagen University College of Engineering.

Copyright © 1996 - 2011. Last updated 2011-06-08.

Contents
1 Introduction ....................................................................................................................... 4
1.1 About this manual ....................................................................................................... 4
1.2 Microprocessor versions covered by this manual........................................................ 5
2 Out-of-order execution (All processors except P1, PMMX)................................................ 7
2.1 Instructions are split into µops..................................................................................... 7
2.2 Register renaming ...................................................................................................... 8
3 Branch prediction (all processors) ................................................................................... 10
3.1 Prediction methods for conditional jumps.................................................................. 10
3.2 Branch prediction in P1............................................................................................. 15
3.3 Branch prediction in PMMX, PPro, P2, and P3 ......................................................... 19
3.4 Branch prediction in P4 and P4E .............................................................................. 20
3.5 Branch prediction in PM and Core2 .......................................................................... 23
3.6 Branch prediction in Intel Nehalem ........................................................................... 25
3.7 Branch prediction in Intel Sandy Bridge .................................................................... 26
3.8 Branch prediction in Intel Atom ................................................................................. 26
3.9 Branch prediction in VIA Nano.................................................................................. 27
3.10 Branch prediction in AMD K8 and K10.................................................................... 28
3.11 Branch prediction in AMD Bobcat ........................................................................... 30
3.12 Indirect jumps on older processors ......................................................................... 31
3.13 Returns (all processors except P1) ......................................................................... 31
3.14 Static prediction ...................................................................................................... 32
3.15 Close jumps............................................................................................................ 33
4 Pentium 1 and Pentium MMX pipeline............................................................................. 35
4.1 Pairing integer instructions........................................................................................ 35
4.2 Address generation interlock..................................................................................... 39
4.3 Splitting complex instructions into simpler ones ........................................................ 39
4.4 Prefixes..................................................................................................................... 40
4.5 Scheduling floating point code .................................................................................. 41
5 Pentium Pro, II and III pipeline......................................................................................... 44
5.1 The pipeline in PPro, P2 and P3 ............................................................................... 44
5.2 Instruction fetch ........................................................................................................ 44
5.3 Instruction decoding.................................................................................................. 45
5.4 Register renaming .................................................................................................... 49
5.5 ROB read.................................................................................................................. 49
5.6 Out of order execution .............................................................................................. 53
5.7 Retirement ................................................................................................................ 54
5.8 Partial register stalls.................................................................................................. 55
5.9 Store forwarding stalls .............................................................................................. 58
5.10 Bottlenecks in PPro, P2, P3.................................................................................... 59
6 Pentium M pipeline.......................................................................................................... 61
6.1 The pipeline in PM.................................................................................................... 61
6.2 The pipeline in Core Solo and Duo ........................................................................... 62
6.3 Instruction fetch ........................................................................................................ 62
6.4 Instruction decoding.................................................................................................. 62
2
6.5 Loop buffer ............................................................................................................... 64
6.6 Micro-op fusion ......................................................................................................... 64
6.7 Stack engine............................................................................................................. 66
6.8 Register renaming .................................................................................................... 68
6.9 Register read stalls ................................................................................................... 68
6.10 Execution units ....................................................................................................... 70
6.11 Execution units that are connected to both port 0 and 1.......................................... 70
6.12 Retirement .............................................................................................................. 72
6.13 Partial register access............................................................................................. 72
6.14 Store forwarding stalls ............................................................................................ 74
6.15 Bottlenecks in PM................................................................................................... 74
7 Core 2 and Nehalem pipeline .......................................................................................... 77
7.1 Pipeline..................................................................................................................... 77
7.2 Instruction fetch and predecoding ............................................................................. 77
7.3 Instruction decoding.................................................................................................. 80
7.4 Micro-op fusion ......................................................................................................... 80
7.5 Macro-op fusion........................................................................................................ 81
7.6 Stack engine............................................................................................................. 82
7.7 Register renaming .................................................................................................... 82
7.8 Register read stalls ................................................................................................... 83
7.9 Execution units ......................................................................................................... 84
7.10 Retirement .............................................................................................................. 88
7.11 Partial register access............................................................................................. 88
7.12 Store forwarding stalls ............................................................................................ 89
7.13 Cache and memory access..................................................................................... 91
7.14 Breaking dependency chains .................................................................................. 91
7.15 Multithreading in Nehalem ...................................................................................... 92
7.16 Bottlenecks in Core2 and Nehalem......................................................................... 93
8 Sandy Bridge pipeline ..................................................................................................... 95
8.1 Pipeline..................................................................................................................... 95
8.2 Instruction fetch and decoding .................................................................................. 95
8.3 µop cache................................................................................................................. 95
8.4 Loopback buffer ........................................................................................................ 96
8.5 Micro-op fusion ......................................................................................................... 96
8.6 Macro-op fusion........................................................................................................ 96
8.7 Stack engine............................................................................................................. 97
8.8 Register allocation and renaming.............................................................................. 97
8.9 Register read stalls ................................................................................................... 98
8.10 Execution units ....................................................................................................... 98
8.11 Partial register access........................................................................................... 101
8.12 Transitions between VEX and non-VEX modes .................................................... 102
8.13 Cache and memory access................................................................................... 102
8.14 Store forwarding stalls .......................................................................................... 103
8.15 Multithreading ....................................................................................................... 103
8.16 Bottlenecks in Sandy Bridge ................................................................................. 104
9 Pentium 4 (NetBurst) pipeline........................................................................................ 106
9.1 Data cache ............................................................................................................. 106
9.2 Trace cache............................................................................................................ 106
9.3 Instruction decoding................................................................................................ 111
9.4 Execution units ....................................................................................................... 112
9.5 Do the floating point and MMX units run at half speed? .......................................... 114
9.6 Transfer of data between execution units................................................................ 117
9.7 Retirement .............................................................................................................. 119
9.8 Partial registers and partial flags............................................................................. 120
9.9 Store forwarding stalls ............................................................................................ 121
9.10 Memory intermediates in dependency chains ....................................................... 121
9.11 Breaking dependency chains ................................................................................ 123
9.12 Choosing the optimal instructions ......................................................................... 123
3
9.13 Bottlenecks in P4 and P4E.................................................................................... 126
10 Intel Atom pipeline....................................................................................................... 129
10.1 Instruction fetch .................................................................................................... 129
10.2 Instruction decoding.............................................................................................. 129
10.3 Execution units ..................................................................................................... 129
10.4 Instruction pairing.................................................................................................. 130
10.5 X87 floating point instructions ............................................................................... 131
10.6 Instruction latencies .............................................................................................. 132
10.7 Memory access..................................................................................................... 132
10.8 Branches and loops .............................................................................................. 133
10.9 Multithreading ....................................................................................................... 133
10.10 Bottlenecks in Atom............................................................................................ 134
11 VIA Nano pipeline........................................................................................................ 135
11.1 Performance monitor counters.............................................................................. 135
11.2 Instruction fetch .................................................................................................... 135
11.3 Instruction decoding.............................................................................................. 135
11.4 Instruction fusion................................................................................................... 135
11.5 Out of order system .............................................................................................. 136
11.6 Execution ports ..................................................................................................... 136
11.7 Latencies between execution units ....................................................................... 137
11.8 Partial registers and partial flags........................................................................... 139
11.9 Breaking dependence ........................................................................................... 139
11.10 Memory access................................................................................................... 140
11.11 Branches and loops ............................................................................................ 140
11.12 VIA specific instructions ...................................................................................... 140
11.13 Bottlenecks in Nano............................................................................................ 141
12 AMD K8 and K10 pipeline ........................................................................................... 142
12.1 The pipeline in AMD processors ........................................................................... 142
12.2 Instruction fetch .................................................................................................... 144
12.3 Predecoding and instruction length decoding........................................................ 144
12.4 Single, double and vector path instructions........................................................... 145
12.5 Stack engine......................................................................................................... 146
12.6 Integer execution pipes......................................................................................... 146
12.7 Floating point execution pipes............................................................................... 146
12.8 Mixing instructions with different latency ............................................................... 148
12.9 64 bit versus 128 bit instructions........................................................................... 149
12.10 Data delay between differently typed instructions................................................ 150
12.11 Partial register access......................................................................................... 150
12.12 Partial flag access............................................................................................... 151
12.13 Store forwarding stalls ........................................................................................ 151
12.14 Loops.................................................................................................................. 152
12.15 Cache ................................................................................................................. 152
12.16 Bottlenecks in AMD............................................................................................. 154
13 AMD Bobcat pipeline................................................................................................... 155
13.1 The pipeline in AMD Bobcat.................................................................................. 156
13.2 Instruction fetch .................................................................................................... 156
13.3 Instruction decoding.............................................................................................. 156
13.4 Single, double and complex instructions ............................................................... 156
13.5 Integer execution pipes......................................................................................... 156
13.6 Floating point execution pipes............................................................................... 156
13.7 Mixing instructions with different latency ............................................................... 157
13.8 Dependency-breaking instructions........................................................................ 157
13.9 Data delay between differently typed instructions ................................................. 157
13.10 Partial register access......................................................................................... 157
13.11 Cache ................................................................................................................. 157
13.12 Store forwarding stalls ........................................................................................ 158
13.13 Bottlenecks in Bobcat ......................................................................................... 158
13.14 Literature: ........................................................................................................... 158
4
14 Comparison of microarchitectures ............................................................................... 158
14.1 The AMD kernel.................................................................................................... 158
14.2 The Pentium 4 kernel............................................................................................ 160
14.3 The Pentium M kernel........................................................................................... 161
14.4 Intel Core 2 and Nehalem microarchitecture ......................................................... 162
14.5 Intel Sandy Bridge microarchitecture .................................................................... 163
15 Comparison of low power microarchitectures .............................................................. 164
15.1 Intel Atom microarchitecture ................................................................................. 164
15.2 VIA Nano microarchitecture .................................................................................. 164
15.3 AMD Bobcat microarchitecture.............................................................................. 164
15.4 Conclusion............................................................................................................ 164
15.5 Future trends ........................................................................................................ 166
16 Literature..................................................................................................................... 169
17 Copyright notice .......................................................................................................... 169

Download:

http://agner.org/optimize/microarchitecture.pdf

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.



×
×
  • Create New...