#### **How to Write Fast Numerical Code**

Spring 2019

Lecture: Architecture/Microarchitecture and Intel Core

Instructor: Markus Püschel

TA: Tyler Smith, Gagandeep Singh, Alen Stojanov

ETH

Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

#### **Technicalities**

- Midterm: Mon, April 15<sup>th</sup>
- Research project:
  - Let us know once you have a partner
  - If you have a project idea, talk to me (break, after class, email)
  - Deadline: March 4th
- Finding partner: <u>fastcode-forum@lists.inf.ethz.ch</u>

# **Today**

- Architecture/Microarchitecture: What is the difference?
- In detail: Intel Haswell and Sandybridge
- Crucial microarchitectural parameters
- Peak performance
- Operational intensity

3

#### **Definitions**

- Architecture (also instruction set architecture = ISA): The parts of a processor design that one needs to understand to write assembly code
- Examples: instruction set specification, registers
- Counterexamples: cache sizes and core frequency
- Example ISAs
  - x86
  - ia
  - MIPS
  - POWER
  - SPARC
  - ARM

#### Some assembly code



# ISA SIMD (Single Instruction Multiple Data) Vector Extensions

- What is it?
  - Extension of the ISA. Data types and instructions for the parallel computation on short (length 2–8) vectors of integers or floats



- Names: MMX, SSE, SSE2, ..., AVX, ...
- Why do they exist?
  - Useful: Many applications have the necessary fine-grain parallelism
     Then: speedup by a factor close to vector length
  - Doable: Chip designers have enough transistors to play with; easy to build with replication
- We will have an extra lecture on vector instructions
  - What are the problems?
  - How to use them efficiently

#### FMA = Fused Multiply-Add

- x = x + y\*z
- Done as one operation, i.e., involves only one rounding step
- Better accuracy than sequence of mult and add
- Natural pattern in many algorithms

```
/* matrix multiplication; A, B, C are n x n matrices of doubles */
for (i = 0; i < n; i++)
  for (j = 0; j < n; j++)
    for (k = 0; k < n; k++)
        C[i*n+j] += A[i*n+k]*B[k*n+j];</pre>
```

Exists only recently in Intel processors (Why?)



# **Definitions**

- Microarchitecture: Implementation of the architecture
- Examples: caches, cache structure, CPU frequency, details of the virtual memory system
- Examples
  - Intel processors (Wikipedia)
  - AMD processors (Wikipedia)

9

# **Intel's Tick-Tock Model**

- Tick: Shrink of process technology
- Tock: New microarchitecture
  - **Example: Core and successors**Shown: Intel's microarchitecture code names (server/mobile may be different)



In 2016 the Tick-tock model got discontinued

Now: process – architecture – optimization (since Tick becomes harder)





• Distribute microarchitecture abstraction



#### **Runtime Bounds (Cycles) on Haswell**

```
/* x, y are vectors of doubles of length n, alpha is a double */
for (i = 0; i < n; i++)
    x[i] = x[i] + alpha*y[i];</pre>
```

maximal achievable percentage of (vector) peak performance

Number flops? 2n

Runtime bound no vector ops: n/2

Runtime bound vector ops: n/8

■ Runtime bound data in L1: n/4 50

■ Runtime bound data in L2: n/4 50

■ Runtime bound data in L3: n/2 25

Runtime bound data in main memory: n 12.5

Runtime dominated by data movement:

Memory-bound

### **Runtime Bounds (Cycles) on Core 2**

```
/* matrix multiplication; A, B, C are n x n matrices of doubles */
for (i = 0; i < n; i++)
  for (j = 0; j < n; j++)
    for (k = 0; k < n; k++)
        C[i*n+j] += A[i*n+k]*B[k*n+j];</pre>
```

Number flops? 2n<sup>3</sup>

Runtime bound no vector ops: n³/2

Runtime bound vector ops: n<sup>3</sup>/8

Runtime bound data in L1: 3/8 n<sup>2</sup>

...

Runtime bound data in main memory: 3/2 n<sup>2</sup>

Runtime dominated by data operations (except very small n):

Compute-bound

# **Operational Intensity**

Definition: Given a program P, assume cold (empty) cache

Operational intensity: 
$$I(n) = \frac{W(n)}{Q(n)}$$
 #flops (input size n)

#bytes transferred cache  $\leftrightarrow$  memory (for input size n)

17

# **Operational Intensity (Cold Cache)**

```
/* x, y are vectors of doubles of length n, alpha is a double */
for (i = 0; i < n; i++)
    x[i] = x[i] + alpha*y[i];</pre>
```

- Operational intensity:
  - Flops: W(n) = 2n
  - Memory transfers (doubles): ≥ 2n (just from the reads)
  - Reads (bytes): Q(n) ≥ **16n**
  - Operational intensity: I(n) = W(n)/Q(n) ≤ 1/8

# **Operational Intensity (Cold Cache)**

```
/* matrix multiplication; A, B, C are n x n matrices of doubles */
for (i = 0; i < n; i++)
  for (j = 0; j < n; j++)
    for (k = 0; k < n; k++)
        C[i*n+j] += A[i*n+k]*B[k*n+j];</pre>
```

- Operational intensity:
  - Flops: W(n) = 2n<sup>3</sup>
  - Memory transfers (doubles): ≥ 3n² (just from the reads)
  - Reads (bytes): Q(n) ≥ 24n²
  - Operational intensity: I(n) = W(n)/Q(n) ≤ 1/12 n

19

# **Compute/Memory Bound**

- A function/piece of code is:
  - Compute bound if it has high operational intensity
  - Memory bound if it has low operational intensity
- A more exact definition depends on the given platform
- More details later: Roofline model







#### Comments on Intel Haswell µarch

- Peak performance 16 DP flops/cycle (only reached if SIMD FMA)
  - Peak performance mults: 2 mults/cycle (scalar 2 flops/cycle, SIMD AVX 8 flops/cycle)
  - Peak performance adds: 1 add/cycle (scalar 1 flop/cycle, SIMD AVX 4 flops/cycle).
     FMA in port 0 can be used for add, but longer latency
- L1 bandwidth: two 32-byte loads and one 32-byte store per cycle (Sandy Bridge, either one 16-byte load and one 16-byte store, or one 32-byte load)
- Shared L3 cache organized as multiple cache slices for better scalability with number of cores, thus access time is non-uniform
- Shared L3 cache in a different clock domain (uncore)

# **Example: Peak Performance**





# Peak performance of this computer:

- 4 cores x
- 2-way SSE x
- 1 add and 1 mult/cycle
- = 16 flops/cycle
- = 48 Gflop/s

25

#### **Summary**

- Architecture vs. microarchitecture
- To optimize code one needs to understand a suitable abstraction of the microarchitecture and its key quantitative characteristics
  - Memory hierarchy with throughput and latency info
  - Execution units with port, throughput, and latency info
- Operational intensity:
  - High = compute bound = runtime dominated by data operations
  - Low = memory bound = runtime dominated by data movement