# OoO Instruction Benchmarking Framework on the Back of Dragons



Artifact Available git.io/fANPW

Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Regionales Rechenzentrum Erlangen (RRZE) Julian Hammer <julian.hammer@fau.de>, Georg Hager (advisor) <georg.hager@fau.de>, Gerhard Wellein (advisor) <gerhard.wellein@fau.de>

## Goal

## Overview



# **Related Work**







|   | Int     | el Skylal | ke / 17-6700H0 | AMD Zen / EPYC 7451 |       |         |       |            |      |
|---|---------|-----------|----------------|---------------------|-------|---------|-------|------------|------|
| , | [Intel] | [AF]      | Throughput     | [Intel]             | [AF]  | Latency | [AF]  | Throughput | [AF] |
| ) | 1       | 1         | 0.30           | 0.25                | 0.25  | 1.00    | 1.00  | 0.28       | 0.25 |
| ) | 1       | 1         | 0.30           | 0.25                | 0.25  | 1.00    | 1.00  | 0.29       | 0.25 |
| ) | 1       | 1         | 0.30           | 0.25                | 0.25  | 1.00    | n/a   | 0.29       | n/a  |
| ) | 1       | 1         | 0.30           | 0.25                | 0.25  | 1.00    | 1.00  | 0.29       | 0.25 |
|   | n/a     | 0-1       | 0.27           | n/a                 | 0.25  | 0.38    | n/a   | 0.26       | n/a  |
| ) | 3       | 3         | 1.00           | 1                   | 1     | 1.00    | 1.00  | 0.74       | 0.5  |
|   | 6       | n/a       | 2.00           | 1                   | n/a   | 1.00    | n/a   | 1.00       | n/a  |
|   | 4       | 4         | 0.52           | 0.5                 | .5-1  | 3.00    | 3.00  | 1.00       | 1    |
|   | n/a     | 4         | 0.52           | n/a                 | 0.5   | 3.00    | 3.00  | 0.50       | 0.5  |
|   | n/a     | 4         | 0.52           | n/a                 | 0.5   | 3.00    | 3.00  | 0.50       | 0.5  |
|   | n/a     | n/a       | 0.52           | n/a                 | 0.5-1 | 5.00    | 5.00  | 1.00       | 1    |
|   | n/a     | n/a       | 0.52           | n/a                 | 0.5-1 | 5.00    | 5.00  | 0.61       | 1    |
|   | n/a     | n/a       | 0.52           | n/a                 | 0.5-1 | 5.00    | 5.00  | 1.00       | 1    |
| ) | n/a     | n/a       | 0.52           | n/a                 | 0.5-1 | 5.00    | 5.00  | 0.61       | 1    |
|   | n/a     | n/a       | 0.52           | n/a                 | 0.5-1 | 5.00    | 5.00  | 0.61       | 1    |
|   | n/a     | n/a       | 0.52           | n/a                 | 0.5-1 | 5.00    | 5.00  | 0.61       | 1    |
|   | 4       | 4         | 0.52           | 0.5                 | .5-1  | 4.00    | 4.00  | 1.00       | 1    |
|   | n/a     | n/a       | 0.52           | n/a                 | 0.5-1 | 4.00    | 4.00  | 0.55       | 0.5  |
|   | n/a     | n/a       | 0.52           | n/a                 | 0.5-1 | 3.00    | 3.00  | 0.51       | 0.5  |
|   | n/a     | 4         | 0.52           | n/a                 | 0.5   | 3.00    | 3.00  | 0.50       | 0.5  |
|   | n/a     | n/a       | 0.52           | n/a                 | .5-1  | 3.00    | 3.00  | 0.50       | 0.5  |
|   | 14      | 13-14     | 8.00           | 8                   | 8     | 8.00    | 8-13  | 8.00       | 8-9  |
|   | n/a     | 13-15     | 4.00           | n/a                 | 4     | 8.00    | 8-13  | 4.00       | 4-5  |
|   | n/a     | n/a       | 3.00           | n/a                 | 3-5   | 10.01   | 10.00 | 3.02       | 3    |

### **Resource Conflicts**

To quantify the overlap of two distinct instructions, we use the following metric, with the reciprocal throughput of A as  $TP^{-1}(A)$ :

$$\frac{\mathrm{TP}^{-1}(A+B) - \max(\mathrm{TP}^{-1}(A), \mathrm{TP}^{-1}(B))}{\min(\mathrm{TP}^{-1}(A), \mathrm{TP}^{-1}(B))} = \begin{cases} \gg 1 & \text{additional overhead} \\ \approx 1 & \text{no overlap conflict} \\ \approx 0 & \text{complete overlap} \\ \ll 0 & \text{elimination} \end{cases}$$



Four (instruction groups) form: integer, convert and insert, floating point (FP) and FP divisions. Each group overlaps with the others, but shows no overlap within. This is the basis for a concurrency model.

### Load and Store in L1

Load latency and throughput behaviour is studied with pointer chasing. Both AMD Zen and Intel Skylake show a latency of 2 and reciprocal throughput of 0.5 cycles. Stores are work-in-progress, since they are mostly "fire-and-forget", but occupy shared resources in the address generation units.

### ISA Extraction

We parse LLVM's TableGen database to extract available instructions of an instruction set architecture (ISA). This works well for x86, but will require some adaptations for other architectures.

### Future Work

- More flexible instruction serialization
- Combined load and instruction benchmarking
- Store benchmarks
- Parallel benchmarking for higher throughput

# >Try It Yourself

Disable frequency scaling and turbo mode, then run:

\$ python3 -m asmbench.sc18src

Released under AGPLv3 github.com/RRZE-HPC/asmbench

• Support for other instruction set architectures (i.e., ARM, Power8)

```
$ pip3 install --user asmbench[sc18src]==0.1.2
```