### 263-0007-00: Advanced Systems Lab

Assignment 1: 100 points Due Date: Th, March 9th, 17:00 https://acl.inf.ethz.ch/teaching/fastcode/2023/ Questions: fastcode@lists.inf.ethz.ch

#### Exercises:

- 1. (15 pts) Get to know your machine Determine and create a table for the following microarchitectural parameters of your computer:
  - (a) Processor manufacturer, name, number and microarchitecture (e.g. Haswell, Skylake, etc). Solution: Intel(R) Xeon(R) CPU E3-1275 v5 (Skylake).
  - (b) CPU base frequency.Solution: 3.6 GHz is the nominal CPU frequency.
  - (c) CPU maximum frequency. Does your CPU support Turbo Boost or a similar technology? Solution: It does support Turbo Boost, and the maximum frequency is 4.0GHz.
  - (d) Phase in the Intel's development model: Tick, Tock or Optimization. (if applicable) **Solution:** Tock phase (Skylake).

Intel's processors offer two different floating-point instruction sets, namely x87 and SSE/SSE2, that can perform scalar floating-point operations. For example, a floating-point division can be performed using either FDIV (from x87) or DIVSD (from SSE2) assembly instructions. The x87 instruction set, however, is becoming deprecated but is still supported for backward compatibility.

(e) Name three differences between x87 and SSE2.

Solution: Some possible answers:

- SSE2 supports vector operations using 128-bit registers.
- SSE2 instructions in modern processor usually have better throughput than x87 equivalent instructions.
- x87 works on 80-bit floating-point precision.
- x87 supports also trigonometric functions.
- SSE2 has a register based programming model whereas x87 is stack based.

For one core and **without** using SIMD vector instructions, determine the following about your machine. In (g)-(h), make sure to use the correct floating-point instruction (not the one from x87 in case you have an Intel processor) and provide the reference where you found the latency and throughput information.

(f) Maximum theoretical floating-point peak performance in flops/cycle.

Solution: Without SIMD instructions, two FMAs can be issued per cycle. Thus, 4 flops/cycle.

(g) Latency [cycles], throughput [ops/cycle] and instruction name for both double- and single-precision floating-point multiplication.

Solution: Latency: 4 cycles. Throughput: 2 per cycle. Instruction: MULSS(D).

(h) Latency [cycles], throughput [ops/cycle] and instruction name for both double- and single-precision square root.

# Solution: According to Intel: For single precision (SQRTSS) Latency: 13 cycles. Throughput: 0.33 per cycle. For double precision (SQRTSD) Latency: 18 cycles. Throughput: 0.166 per cycle. According to Agner Fog's measurements: For single precision (SQRTSS) Latency: 12 cycles. Throughput: 0.33 per cycle. For double precision (SQRTSD) Latency: 15-16 cycles. Throughput: 0.166-0.25 per cycle.

 (i) Latency [cycles], throughput [ops/cycle] and instruction name for double-precision ceiling operation, i.e., the operation that rounds a floating-point number up to an integer-valued floating-point.
 Solution:

Skylake: Latency: 8 cycles. Throughput: 1 per cycle. Instruction: ROUNDSD. Firestorm (M1): Latency: 3 cycles. Throughput: 4 per cycle. Instruction: FRINTP

2. (20 pts) Matrix multiplication

In this exercise, we provide a C source file for multiplying an  $n \times n$  matrix with its transpose and a C header file that allows to read the time stamp counter (TSC) of the processor for x86 compatible systems. The code uses different timers available to time the matrix multiplication. Note that if you have an Apple M1/M2 processor, you can still access some of the timers available so you can still complete the homework. Inspect and understand the code and do the following:

- (a) Using your computer, compile and run the code. Compile with the highest level of optimization provided by your compiler (with GCC, compile with the flag -03). A modern compiler will automatically vectorize this very simple routine. Ensure you get consistent timings between timers and for at least two consecutive executions. Don't forget to disable Turbo Boost. (No need to answer anything here)
- (b) Inspect the compute() function in mmm.c and answer the following:
  - i. Determine the exact number of floating-point additions and multiplications that it performs. Solution: The code performs  $2n^3 + n^2$  floating-point operations.
  - ii. Determine an upper bound on its operational intensity (consider only reads and assume empty caches).

Solution:

 $W(n) = 2n^3 + n^2$  and  $Q(n) \ge 2 \cdot 8n^2$ . Thus,  $I(n) \le \frac{n}{8}$  flops/bytes.

- (c) For all square matrices of sizes n between 100 and 1500, in increments of 100, create a performance plot with n on the x-axis and performance (flops/cycle) on the y-axis. Create three series such that:
  - i. The first series has all optimizations disabled: use flag -00.
  - ii. The second series has the major optimizations except for vectorization: use flags -O3 and -fno-tree-vectorize. If you are using the clang compiler, also add -fno-slp-vectorize flag to disable vectorization.
  - iii. The third series has all major optimizations enabled including vectorization: use flags -O3, -ffast-math and -march=native. If you are using an Apple M1 processor and your compiler doesn't support -march=native you can use -mcpu=apple-m1 instead.

Solution:



Figure 1: Plots resulting from execution of mmm.c (vector peak performance: 16 f/c for the given flags).

(d) Discuss performance variations of your plots and report the highest performance that you achieved.

# Solution:

- i. Non-optimized (v1): This results in machine code that is neither optimized or vectorized. This is nice for debugging. However, the performance is low and flat across problem sizes.
- ii. Optimized but non-vectorized (v2): The performance is better than in the previous case. However, the performance suffers due to the limited amount of ILP caused by inter loop dependencies.
- iii. Fully optimized (v3): The -ffast-math flag enables ILP which is combined with vectorization and significantly improves performance. The computation performs well for small problem sizes but performance suffers greatly as soon as the matrix A no longer fits in the cache. The highest performance that we achieve is 2.4 flops/cycle.
- 3. (25 pts) Performance analysis and bounds

Assume that vectors u, w, x, y and z of length n are implemented using double precision floating-point and combined as follows:

$$z_i = u_i \cdot u_i \cdot u_i + z_i \cdot \max(x_i - y_i, u_i - w_i)$$

We consider a Core i7 CPU with a Skylake microarchitecture. As seen in the lecture, it offers FMA instructions (as part of AVX2). Recall that we consider cost of the FMA instruction as two floating-point operations (an addition and a multiplication). Assume the bandwidths that are given in the additional material from the lecture: Abstracted Microarchitecture. Assume that no optimization is performed that simplifies floating-point arithmetic (i.e. -ffast-math flag is not used) and that the max function is translated to a maxsd instruction by the compiler. Answer the following and justify your answers.

(a) Define a suitable detailed floating-point cost measure C(n). Solution:

 $C(n) = C_{add} \cdot N_{add} + C_{mult} \cdot N_{mult} + C_{max} \cdot N_{max}.$ 

(b) Compute the cost C(n) of the computation.

Solution:

$$\begin{split} N_{add} &= 3n, \\ N_{mul} &= 3n, \\ N_{max} &= n, \\ C(n) &= C_{add} \cdot (3n) + C_{mul} \cdot (3n) + C_{max} \cdot (n). \end{split}$$

- (c) Consider only one core without using vector instructions (i.e. using flag -fno-tree-vectorize) and determine a hard lower bound (not asymptotic) on the runtime (measured in cycles), based on:
  - i. The throughput of the floating-point operations. Assume that no FMA instructions are used. Be aware that the lower bound is also affected by the available ports offered for the computation (see lecture slides).
  - ii. The throughput of the floating-point operations where FMAs are used to fuse an addition and a multiplication (i.e. -mfma flag is enabled).
  - iii. The throughput of data reads, for the following two cases: All floating-point data is L3resident, and all floating-point data is RAM-resident. Consider the best case scenario (peak bandwidth and ignore latency). Note that arrays that are only written are also read and this read should be included.

**Solution:** We can obtain bounds by examining which execution ports the instructions are scheduled to and the throughputs of those instructions.

- i. The instruction mix in this case consists of 3n floating-point additions and 3n floating-point multiplications and n floating-point max operations. All operations can be scheduled in either Port 0 or Port 1. Thus, a lower bound on the runtime is 3.5n cycles.
- ii. We can only fuse the final addition with a multiplication into an FMA. Thus, we have n FMA instructions, 2n additions, 2n multiplications and n max operations. FMAs can also be scheduled in either Port 0 or Port 1. Thus, resulting in a lower bound of 3n cycles.
- iii. Abstracted Microarchitecture shows peak bandwidth of L3, and an estimate for the RAM throughput. In the computation, at least 5n doubles have to be read in total. Thus,  $r_{L3} \geq \frac{5n}{4}$  and  $r_{RAM} \geq \frac{5n}{2}$ .
- (d) Determine an upper bound on the operational intensity. Assume empty caches and consider only reads but note: arrays that are only written are also read and this read should be included.

**Solution:** The operational intensity is  $I(N) \leq \frac{7nflops}{8(5n)bytes} = \frac{7}{40}$  flops/byte.

4. (25 pts) Basic optimization

Consider the following function:

```
1 void comp(double *x, double *y, int n) {
2     double s = 0.0;
3     for (int i = 0; i < n; i++) {
4         s = (s + x[i]*x[i]) + y[i]*y[i]*y[i];
5     }
6     x[0] = s;
7 }</pre>
```

(a) Create a benchmarking infrastructure based on the timing function that produces the most consistent results in Exercise 2 and for all two-power sizes  $n = 2^4, \ldots, 2^{23}$  create a performance plot for the function comp with n on the x-axis (choose logarithmic scale) and performance (in flops/cycle) on the y-axis. Randomly initialize all arrays. For all n repeat your measurements 30 times reporting the median in your plot. Compile your code with flags -O3 -mfma<sup>1</sup> -fno-tree-vectorize. If you are using clang, add also the -fno-slp-vectorize and -ffp-contract=fast flags.

 $<sup>^{1}</sup>$ For Apple M1/M2 processors, the flag -mfma may not be supported. If this is the case, use instead -mcpu=apple-m1 or -march=native.

(b) Considering the latency and throughput information of floating-point operations in your machine, and the dependencies in comp, derive an upper bound on the performance (flops/cycles) of comp when using the specified flags in (a), i.e., when FMA instructions are enabled (-mfma) but vectorization is disabled (-fno-tree-vectorize).

### Solution:

The runtime is limited by an inter loop dependency when accumulating the values in **s**. Further, the two additions will be combined with two multiplications into FMAs. The latency of FMA is 4 cycles (Skylake) and there are two in serie in every loop iterations. Thus,  $T(n) \ge 8n$ . Since W(n) = 5n, the performance is upper bounded by  $\pi(n) \le 0.625$  flops/cycle.

- (c) Perform optimizations that increase the ILP of function comp to improve its runtime. It is not allowed to use vector instructions. Add the performance to the previous plot (so one plot with two series in total for (a) and (c)). Compile your code with the same flags as before.
- (d) Discuss performance variations of your plot and report the highest performance that you achieved. Also discuss the optimizations that you performed to increase the ILP.
- (e) Enroll and submit the code of your optimized function in Code Expert. Carefully read and follow the instructions given in Code Expert to submit your code.

### Solution:



Figure 2: Performance plot (peak performance: 4 f/c for the given flags).

In the original code, the performance suffers from inter loop dependency which limits the amount of ILP. Thus, the performance is 0.62 flops/cycle across all problem sizes and it's consistent with the upper bound derived in (b). Unrolling the loop and using separate accumulators increases the ILP. For the given machine, we need at least 6 accumulators. We see that performance varies across problem sizes. Performance is great when the data fits in cache, and becomes worse as the size of the data grows. We can even see "steps": performance is greatest when the data fits in L1, and becomes incrementally worse as it no longer fits in subsequent levels of cache. The maximum performance achieved is 3.2 flops/cycle.

### 5. (10 pts) ILP analysis

Consider the following computations:

```
1
    #include <math.h>
 \mathbf{2}
 3
    double artcomp1(double a, double b, double c, double d) {
 4
         double r:
 \mathbf{5}
         r = (a*a + b*b) - (c*c + d*d);
 \mathbf{6}
          eturn r;
    }
 7
 8
 9
    double artcomp2(double a, double b, double c, double d) {
10
         double r;
         r = ceil(a)*b - ceil(c)*d;
11
12
         return r:
13
    }
```

Make the same assumptions as in Exercise 3, i.e., consider a Skylake processor, only one core without using vector instructions (using flag -fno-tree-vectorize), and assume that no optimization is performed that simplifies floating-point arithmetic (i.e. -ffast-math flag is not used). Thus, it is not allowed to apply associativity and distributivity laws to rearrange the computation. Determine hard lower bounds (not asymptotic) on the runtime (measured in cycles) for the following cases, based on the latency, throughput and dependencies of the floating-point operations only. Assume that no FMA instruction is generated (i.e. -mfma flag is not used). Be aware that the lower bound is also affected by the available ports offered for the computation (see lecture slides). It may be useful to draw the dependency graph of the computation. Justify your answers.

(a) Determine a hard lower bound on the runtime for artcomp1.

Solution: The runtime is at least 13 cycles as shown in the critical path of the dependency graph in Figure 3 (left). Note that the four multiplications cannot start at the same time due to limited throughput of 2 mults/cycle in Skylake. Thus, the start of two of the multiplications will be delayed by 1 cycle.

(b) Determine a hard lower bound on the runtime for artcomp2. Assume that the compiler transforms the ceil function to the respective assembly instruction performing this operation (so no function call to the math library occurs).

**Solution:** The runtime is at least **16 cycles** as shown in the critical path of the dependency graph in Figure 3 (right). Note that according to Agners' tables the overall throughput of the ceil operation (roundsd) is 1 but it consists of two micro-ops that can use ports 0 and 1. Thus, it is possible that two ceil operations can be scheduled in parallel giving a lower bound of 16 for the computation. Since there is no detail on how these micro-ops will be scheduled we also accept 17 cycles as correct which corresponds to the case where the two ceil operations can not be scheduled in parallel.



Figure 3: Dependency graph for artcomp.