

<section-header><section-header><section-header><section-header><section-header>

## Today

Architecture/Microarchitecture: What is the difference?

In detail: Intel Skylake

Crucial microarchitectural parameters

Peak performance

**Operational intensity** 

Brief: Apple M1 processor

3



| <b>MMX:</b><br>Multimedia extension           | Inte | el x86               | Processors (subset)                    |      |
|-----------------------------------------------|------|----------------------|----------------------------------------|------|
| <b>SSE:</b><br>Streaming SIMD extension       |      | x86-16               | 8086<br>286                            | 1978 |
| AVX:<br>Advanced vector extensions            |      | <b>х86-32</b><br>ММХ | 386<br>486<br>Pentium<br>Pentium MMX   |      |
| Backward compatibl<br>Old binary code (≥ 8086 |      | SSE<br>SSE2<br>SSE3  | Pentium III<br>Pentium 4<br>Pentium 4E |      |
| runs on newer processo                        | ors. | x86-64               | Pentium 4F<br>Core 2                   | time |
| New code to run on o processors?              | old  | SSE4                 | Penryn<br>Core i3/5/7                  |      |
| Depends on compiler fl                        | ags. | AVX<br>AVX2          | Sandy Bridge<br>Haswell                |      |
|                                               |      | AVX-512              | Skylake-X                              |      |
|                                               |      |                      | Icelake                                |      |
|                                               |      |                      |                                        | 5    |



# FMA = Fused Multiply-Add

 $x = x + y \cdot z$ 

Done as one operation, i.e., involves only one rounding step

Better accuracy than sequence of mult and add

Natural pattern in many algorithms

```
/* matrix multiplication; A, B, C are n x n matrices of doubles */
for (i = 0; i < n; i++)
  for (j = 0; j < n; j++)
    for (k = 0; k < n; k++)
        C[i*n+j] += A[i*n+k]*B[k*n+j];</pre>
```

Exists only recently in Intel processors



| MMX:<br>Multimedia extension       |                      | Intel x86 |                            | Processors (subset)                                              |      |
|------------------------------------|----------------------|-----------|----------------------------|------------------------------------------------------------------|------|
| SSE:<br>Streaming SIMD extension   |                      |           | x86-16                     | 8086<br>286                                                      |      |
| AVX:<br>Advanced vector extensions |                      |           | x86-32                     | 386<br>486                                                       |      |
| 4-way<br>2-way (                   | single —<br>double — |           | MMX<br>SSE<br>SSE2<br>SSE3 | Pentium<br>Pentium MMX<br>Pentium III<br>Pentium 4<br>Pentium 4E |      |
|                                    |                      |           | <b>x86-64</b><br>SSE4      | Pentium 4F<br>Core 2<br><i>Penryn</i><br>Core i3/5/7             | time |
| 8-way single, 4-way o              | double —<br>FMAs —   |           | —— AVX<br>—— AVX2          | Sandy Bridge<br>Haswell                                          |      |
| 16-way single, 8-way o             | double —             |           | - AVX-512                  | Skylake-X                                                        | L    |
|                                    |                      |           |                            | Icelake                                                          |      |
|                                    |                      |           |                            |                                                                  |      |

























## **Superscalar Processor**

Definition: A superscalar processor can issue and execute *multiple instructions in one cycle*. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.

Benefit: Superscalar processors can take advantage of *instruction level parallelism (ILP)* that many programs have

Most CPUs since about 1998 are superscalar

Intel: since Pentium Pro

Simple embedded processors are usually not superscalar



|                             |                     |                           | ls and i                             | orts           | (Skyla                                                                          | ке)          |             |
|-----------------------------|---------------------|---------------------------|--------------------------------------|----------------|---------------------------------------------------------------------------------|--------------|-------------|
| Port 0                      | Port 1              | L Port 2                  | 2 Port 3                             | Port 4         | Port 5                                                                          | Port 6       | Port 7      |
| fp fma                      | fp fm               | a load                    | load                                 | store          | SIMD log                                                                        | Int ALU      | st addr     |
| fp mul                      | fp mu               |                           |                                      |                | shuffle                                                                         |              |             |
| fp add                      | fp ad               |                           | execution u                          | nits           | fp mov                                                                          |              |             |
| fp div                      | SIMD I              | og fp :                   | = floating point                     |                | Int ALU                                                                         |              |             |
| SIMD log                    | Int AL              |                           | = logic<br>units do scalar <i>an</i> | d vector flops |                                                                                 |              |             |
| Int ALU                     |                     | SIN                       | 1D log: other, non                   | -fp SIMD ops   |                                                                                 |              |             |
| Execution<br>Unit (fp)      | Latency<br>[cycles] | Throughput<br>[ops/cycle] | Gap<br>[cycles/issue]                |                | oort can issue<br>L/throughput                                                  |              | tion/cycle  |
| fma                         | 4                   | 2                         | 0.5                                  | • Intel so     | iys gap for tl                                                                  | he throughp  | ut!         |
| mul                         | 4                   | 2                         | 0.5                                  | Same e         | exec units for                                                                  | scalar and v | ector flops |
| add                         | 4                   | 2                         | 0.5                                  |                | Same latency/throughput for scalar<br>(one double) and AVX vector (four doubles |              |             |
| div (scalar)<br>div (4-way) | 14<br>14            | 1/4<br>1/8                | 4<br>8                               | flops, e       | except for div                                                                  | ,            |             |





| Port 0                                                | Port 1                                                            | Port 2                                | Port 3                               | Port 4    | Port 5                       | Port 6  | Port 7   |
|-------------------------------------------------------|-------------------------------------------------------------------|---------------------------------------|--------------------------------------|-----------|------------------------------|---------|----------|
| Ļ                                                     |                                                                   |                                       |                                      |           | 1                            |         |          |
| fp fma                                                | fp fma                                                            | load                                  | load                                 | store     | SIMD log                     | Int ALU | st addr  |
| fp mul                                                | fp mul                                                            | st addr                               | st addr                              |           | shuffle                      |         |          |
| fp add                                                | fp add                                                            | exe                                   | ecution un                           | nits      | fp mov                       |         |          |
| fp div                                                | SIMD log                                                          |                                       |                                      |           | Int ALU                      |         |          |
|                                                       | •                                                                 |                                       |                                      |           |                              |         |          |
| •                                                     | Int ALU                                                           |                                       |                                      |           |                              |         |          |
| •                                                     |                                                                   |                                       |                                      |           | IIII ALO                     |         |          |
| SIMD log                                              | Int ALU                                                           | s are at                              | least red                            | nuired (r |                              | r ons)2 |          |
| SIMD log<br>Int ALU                                   | Int ALU                                                           |                                       |                                      | quired (r |                              | r ops)? | n/2      |
| SIMD log<br>Int ALU<br>Iow ma                         | Int ALU<br>any cycle                                              | dds and n                             | mults in tl                          | he C code | no vecto                     |         | n/2      |
| SIMD log<br>Int ALU<br>Iow ma                         | Int ALU<br>any cycle                                              | dds and n                             | mults in tl                          |           | no vecto                     |         | n/2<br>n |
| SIMD log<br>Int ALU<br>Iow ma<br>function             | Int ALU<br>any cycle<br>n with n ac                               | dds and n                             | mults in tl<br>nult instru           | he C code | no vecto                     |         |          |
| SIMD log<br>Int ALU<br>IOW Ma<br>function<br>function | Int ALU<br>any cycle<br>n with n ac<br>n with n ac<br>n with n ac | dds and n<br>dd and n n<br>dds in the | mults in tl<br>nult instru<br>C code | he C code | <b>no vecto</b><br>the assem |         | n        |







### **Firestorm Microarchitecture**

#### Integer ports:

- 1: alu + flags + branch + addr + msr/mrs nzcv + mrs
- 2: alu + flags + branch + addr + msr/mrs nzcv + ptrauth
- 3: alu + flags + mov-from-simd/fp?
- 4: alu + mov-from-simd/fp?
- 5: alu + mul + div
- 6: alu + mul + madd + crc + bfm/extr

#### Load and store ports:

- 7: store + amx
- 8: load/store + amx
- 9: load 10: load

### FP/SIMD ports:

| 11: | fp/simd |  |
|-----|---------|--|

- 12: fp/simd
- 13: fp/simd + fcsel + to-gpr
- 14: fp/simd + fcsel + to-gpr + fcmp/e + fdiv + ...

| Instruction | Latency<br>[cycles] | Throughput<br>[ops/cycle] | Gap<br>[cycles/issue] |
|-------------|---------------------|---------------------------|-----------------------|
| add         | 3                   | 4                         | 0.25                  |
| mul         | 4                   | 4                         | 0.25                  |
| div         | 10                  | 1                         | 1                     |
| load        |                     | 3                         | 0.33                  |
| store       |                     | 2                         | 0.5                   |

Latency and throughput of FP instructions in double precision. The numbers are the same for scalar and vector instructions.

This information is based on black-box reverse engineering https://dougallj.github.io/applecpu/firestorm.html

29

| nteger ports:<br>.: alu + br + mrs                      |            | Instruction                 | Latency<br>[cycles] | Throughput<br>[ops/cycle] | Gap<br>[cycles/issue]             |
|---------------------------------------------------------|------------|-----------------------------|---------------------|---------------------------|-----------------------------------|
| 2: alu + br + div + ptrauth<br>3: alu + mul + bfm + crc |            | add                         | 3                   | 2                         | 0.5                               |
|                                                         |            | mul                         | 4                   | 2                         | 0.5                               |
| oad and store ports:<br>I: load/store + amx<br>i: load  |            | div (scalar)<br>div (2-way) | 10<br>11            | 1<br>0.5                  | 1<br>2                            |
|                                                         |            | load                        |                     | 2                         | 0.5                               |
| FP/SIMD ports:<br>5: fp/simd                            |            | store                       |                     | 1                         | 1                                 |
| ': fp/simd + fcsel + to-gpr + fcmp/e                    | e + fdiv + |                             | numbers a           | are the same for          | tions in double scalar and vector |
|                                                         |            |                             |                     |                           |                                   |
|                                                         |            |                             |                     |                           |                                   |

## Apple M2

Launched in June 2022

5 nm

Firestorm/Icestorm  $\rightarrow$  Avalanche/Blizzard https://en.wikipedia.org/wiki/Apple\_M2

## Apple M3

Possible launch late 2023

3 nm

1 nm is still under research/development but seems possible A typical atom has length 0.1–1 nm



