





- Architecture/Microarchitecture: What is the difference?
- In detail: Core 2/Core i7
- Crucial microarchitectural parameters
- Peak performance
- Operational intensity





























```
/* matrix multiplication; A, B, C are n x n matrices of doubles */
for (i = 0; i < n; i++)
   for (j = 0; j < n; j++)
      for (k = 0; k < n; k++)
        C[i*n+j] += A[i*n+k]*B[k*n+j];</pre>
```

## Operational intensity:

- Flops: W(n) = 2n<sup>3</sup>
- Memory/cache transfers (doubles): ≥ 3n<sup>2</sup> (just from the reads)
- Reads (bytes): Q(n) ≥ 24n<sup>2</sup>
- Operational intensity: I(n) = W(n)/Q(n) ≤ 1/12 n













| <b>MMX:</b><br>Multimedia extension |                |                   |      |
|-------------------------------------|----------------|-------------------|------|
| SSE:                                | Intel x86      | Processors        |      |
| Streaming SIMD extension            | x86-16         | 8086              |      |
| AVX:                                |                |                   |      |
| Advanced vector extensions          |                | 286               | _    |
|                                     | x86-32         | 386               |      |
|                                     |                | 486               |      |
|                                     |                | Pentium           |      |
|                                     | MMX            | Pentium MMX       |      |
|                                     | SSE            | Pentium III       |      |
|                                     | SSE2           | Pentium 4         | time |
|                                     | SSE3           | Pentium 4E        |      |
|                                     | x86-64 / em64t | Pentium 4F        |      |
|                                     |                | Core 2 Duo        |      |
|                                     | SSE4           | Penryn            |      |
|                                     |                | Core i7 (Nehalem) |      |
|                                     | AVX            | Sandy Bridge      | •    |
|                                     | AVX2           | Haswell           |      |
|                                     |                |                   | 24   |



| ingle-precision (SP) FP MUL<br>ouble-precision FP MUL                                             | 4, 1<br>5, 1                                                   | 4, 1                                         | Issue port 0; Writeback port 0                   | SSE based FP |
|---------------------------------------------------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------|--------------------------------------------------|--------------|
| P MUL (X87)                                                                                       | 5, 1                                                           | 5, 1                                         | Issue port 0; Writeback port 0                   | x87 FP       |
| P Shuffle<br>NV/SQRT                                                                              | 1, 1                                                           | 1, 1                                         | FP shuffle does not handle QW shuffle.           | X07 11       |
| <ul> <li>Assume 3 GF</li> <li>6 Gflop/s sco</li> </ul>                                            |                                                                | k perforn                                    | nance on one core                                |              |
| 6 Gflop/s sco                                                                                     | alar pea                                                       |                                              | _                                                |              |
| 6 Gflop/s sco<br>Vector double                                                                    | alar pear<br>precisio                                          | on (SSE2                                     | )                                                |              |
| 6 Gflop/s sco<br>Vector double<br>1 vadd and 1                                                    | <b>precisic</b><br>vmult /                                     | on (SSE2                                     | _                                                |              |
| 6 Gflop/s sco<br>Vector double<br>1 vadd and 1<br>Assume 3 GH                                     | <b>precisic</b><br>vmult /<br>Iz:                              | on (SSE2<br>cycle (2-                        | )<br>way): 4 flops/cycle                         |              |
| 6 Gflop/s sco<br>Vector double<br>1 vadd and 1<br>Assume 3 GH                                     | <b>precisic</b><br>vmult /<br>Iz:                              | on (SSE2<br>cycle (2-                        | )                                                |              |
| 6 Gflop/s sco<br>Vector double<br>1 vadd and 1<br>Assume 3 GH                                     | precisic<br>vmult /<br>iz:<br>eak perf                         | on (SSE2<br>cycle (2-                        | )<br>way): 4 flops/cycle                         |              |
| 6 Gflop/s sco<br>Vector double<br>1 vadd and 1<br>Assume 3 GH<br>12 Gflop/s po<br>Vector single p | precisic<br>precisic<br>vmult /<br>lz:<br>eak perf<br>recisior | on (SSE2<br>cycle (2-<br>formance<br>n (SSE) | )<br>way): 4 flops/cycle                         |              |
| 6 Gflop/s sco<br>Vector double<br>1 vadd and 1<br>Assume 3 GH<br>12 Gflop/s po<br>Vector single p | precisic<br>vmult /<br>lz:<br>eak perj<br>recisior             | on (SSE2<br>cycle (2-<br>formance<br>n (SSE) | <b>)</b><br>way): 4 flops/cycle<br>e on one core |              |





## Summary

- Architecture vs. microarchitecture
- To optimize code one needs to understand a suitable abstraction of the microarchitecture
- Operational intensity:
  - High = compute bound = runtime dominated by data operations
  - Low = memory bound = runtime dominated by data movement