BLM6112<br>Advanced Computer Architecture<br>Fundamentals of Quantitative Design and Analysis<br>Prof. Dr. Nizamettin AYDIN<br>naydin@yildiz.edu.tr<br>http://www3.yildiz.edu.tr/~naydin

## Outline

- Latency, delay, time
- Throughput
- Cost
- Power
- Energy
- Reliability


## What is Computer Architecture?



- Coordination of many levels of abstraction
- Under a rapidly changing set of forces
- Design, Measurement, and Evaluation


## Performance Metrics

## - Objectives

- How can we meaningfully measure and compare computer performance?
- Understand why program performance varies
- Understand how applications and the compiler impact performance
- Understand how CPU impacts performance
- What trade-offs are involved in designing a CPU?
- Purchasing perspective vs design perspective


Computer Architecture is

The attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.

Amdahl, Blaaw, and Brooks, 1964


## A Changing Definition

- 1950s to 1960 s: Computer Architecture Course
- Computer Arithmetic
- 1970s to mid 1980s: Computer Architecture Course
- Instruction Set Design, especially ISA appropriate for compilers
- 1990s: Computer Architecture Course
- Design of CPU, memory system, I/O system, Multiprocessors


## Instruction Set Architecture

- Organization of Programmable Storage
- Data Types \& Data Structures Encodings \& Representations
- Instruction Formats
- Instruction (or Operation Code) Set
- Modes of Addressing and Accessing Data Items and Instructions
- Exceptional Conditions



## Computer Technology

- Performance improvements:
- Improvements in semiconductor technology
- Feature size, clock speed
- Improvements in computer architectures
- Enabled by HLL compilers, UNIX
- Lead to RISC architectures
- Together have enabled:
- Lightweight computers
- Productivity-based managed/interpreted programming languages


## The Instruction Set: a Critical Interface



## Computer Organization

- Capabilities \& Performance Characteristics of Principal Functional Units
- (e.g., Registers, ALU, Shifters, Logic Units, ...)
- Ways in which these components are interconnected
- Information flows between components
- Logic and means by which such information flow is controlled.
- Choreography of FUs to realize the ISA
- Register Transfer Level (RTL) hardware design




## Current Trends in Architecture

- Cannot continue to leverage Instruction-Level Parallelism (ILP)
- Single processor performance improvement ended in 2003
- New models for performance:
- Data-Level Parallelism (DLP)
- Thread-Level Parallelism (TLP)
- Request-Level Parallelism (RLP)
- These require explicit restructuring of the application

A summary of the five mainstream computing classes and

| Feature | Personal mobile device (PMD) | Desktop | Server | Clusters/warehousescale computer | Internet of things/ embedded |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Price of system | \$100-\$1000 | \$300-\$2500 | \$5000-\$10,000,000 | \$100,000-\$200,000,000 | \$10-\$100,000 |
| Price of microprocessor | \$10-\$100 | \$50-\$500 | \$200-\$2000 | \$50-\$250 | \$0.01-5100 |
| Critical system design issucs | Cost, energy. media performance. responsivences | Price performance. energy, graphics performance | Throughput, availability. scalability, energy | Price-performance. throughput, energy proportionality | Price, energy, applicationspecific performance |

- Sales in 2015 included about
- 1.6 billion PMDs ( $90 \%$ cell phones),
- 275 million desktop PCs,
- 15 million servers.
- 19 billion embedded processors.
- In total, 14.8 billion ARM-technology-based chips were shipped in 2015


## their system characteristics

## Classes of Computers

- Personal Mobile Device (PMD)
- e.g. start phones, tablet computers Emphasis on energy efficiency and real-time
- Desktop Computing Emphasis on price-performance
- Servers

Emphasis on availability, scalability, throughput

- Clusters / Warehouse Scale Computer
- Used for "Software as a Service (SaaS)"
- Emphasis on availability and price-performance
- Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks
- Internet of Things/Embedded Computers

Emphasis: price
$\qquad$

## Parallelism

- Computer hardware in turn can exploit these two kinds of application parallelism in four major ways:
- Instruction-Level Parallelism (ILP)
- exploits data-level parallelism at modest levels with compiler help using ideas like pipelining and at medium levels using ideas like speculative execution.
- Vector architectures/Graphic Processor Units (GPUs) and multimedia instruction sets
- exploit data-level parallelism by applying a single instruction to a collection of data in parallel.


## Parallelism

- Parallelism at multiple levels is now the driving force of computer design across all four classes of computers,
- with energy and cost being the primary constraints.
- Classes of parallelism in applications:
- Data-Level Parallelism (DLP)
- arises because there are many data items that can be operated on at the same time.
- Task-Level Parallelism (TLP)
- arises because tasks of work are created that can operate independently and largely in parallel.


## Parallelism

- Thread-Level Parallelism (TLP)
- exploits either data-level parallelism or task-level parallelism in a tightly coupled hardware model that allows for interaction between parallel threads.
- Request-Level Parallelism (RLP)
- exploits parallelism among largely decoupled tasks specified by the programmer or the operating system.
- When Flynn (1966) studied the parallel computing efforts in the 1960s, he found a simple classification whose abbreviations we still use today.
- He looked at the parallelism in the instruction and data streams called for by the instructions at the most constrained component of the multiprocessor and placed all computers in one of four categories:


## Flynn's Taxonomy

- Single instruction stream, single data stream (SISD)
- the uniprocessor
- it can exploit ILP
- Single instruction stream, multiple data streams (SIMD)
- The same instruction is executed by multiple processors using different data streams
- exploit DLP by applying the same operations to multiple items of data in parallel
- Vector architectures
- Multimedia extensions
- Graphics processor units


## Flynn's Taxonomy

- Multiple instruction streams, single data stream (MISD)
- No commercial implementation
- Multiple instruction streams, multiple data streams (MIMD)
- Each processor fetches its own instructions and operates on its own data, and it targets TLP
- can also exploit DLP
- more flexible than SIMD and thus more generally applicable, but it is inherently more expensive than SIMD
- Tightly-coupled MIMD
- Loosely-coupled MIMD


## Instruction Set Architecture

- Class of ISA

General-purpose registers
Register-memory vs load-store


## Instruction Set Architecture

- Memory addressing
- RISC-V: byte addressed, aligned accesses faster
- Addressing modes
- RISC-V: Register, immediate, displacement (base+offset)
- Other examples: autoincrement, indexed, PC-relative
- Types and size of operands
- RISC-V: 8-bit, 32-bit, 64-bit


## Instruction Set Architecture

- Operations
- RISC-V: data transfer, arithmetic, logical, control, floating point
- See Fig. 1.5 in text, or The RISC-V Instruction Set Manual
- Control flow instructions
- Use content of registers (RISC-V) vs. status bits (x86, ARMv7, ARMv8)
- Return address in register (RISC-V, ARMv7, ARMv8) vs. on stack (x86)
- Encoding
- Fixed (RISC-V, ARMv7/v8 except compact instruction set) vs. variable length (x86)


## Trends in Technology

- Integrated circuit technology (Moore's Law)

Transistor density: $35 \% /$ year
Die size: $10-20 \% /$ year
Integration overall: $40-55 \% /$ year

- DRAM capacity: $25-40 \% /$ year (slowing)
$8 \mathrm{~Gb}(2014), 16 \mathrm{~Gb}$ (2019), possibly no 32 Gb
- Flash capacity: 50-60\%/year

8-10X cheaper/bit than DRAM

- Magnetic disk capacity: recently slowed to 5\%/year

Density increases may no longer be possible,

- may be increase from 7 to 9 platter

8-10X cheaper/bit then Flash
200-300X cheaper/bit than DRAM

## Bandwidth and Latency

- Bandwidth or throughput
- Total work done in a given time
- 32,000-40,000X improvement for processors
- 300-1200X improvement for memory and disks
- Latency or response time
- Time between start and completion of an event
- 50-90X improvement for processors
- 6-8X improvement for memory and disks

Bandwidth and Latency


## Transistors and Wires

- Feature size
- Minimum size of transistor or wire in $x$ or $y$ dimension
- $10 \mu \mathrm{~m}$ in 1971
- $0.011 \mu \mathrm{~m}$ in 2017
- $0.003 \mu \mathrm{~m}$ in 2021
- Transistor performance scales linearly
- Wire delay does not improve with feature size!
- Integration density scales quadratically


## Power and Energy

- Problem:
- Get power in, get power out
- Thermal Design Power (TDP)
- Characterizes sustained power consumption
- Used as target for power supply and cooling system
- Lower than peak power (1.5X higher), higher than average power consumption
- Clock rate can be reduced dynamically to limit power consumption
- Energy per task is often a better measurement


## Dynamic Energy and Power

- Dynamic energy

$$
=\frac{\text { Capacitive Load } \times \text { Voltage }^{2}}{2}
$$

- Transistor switch from $0 \boldsymbol{\rightarrow}$ or $1 \boldsymbol{\rightarrow} 0$
- Dynamic power
$=\frac{\text { Capacitive Load } \times \text { Voltage }^{2} \times \text { Frequency switched }^{2}}{2}$
- Reducing clock rate reduces power, not energy

Power

- Intel 80386 consumed $\sim 2$ W
- 3.3 GHz Intel Core i7 consumes 130 W
- Heat must be dissipated from x 1.5 cm chip
- This is the limit of what can be cooled by air



## Static Power

- Static power consumption
$=$ Current $_{\text {static }} \times$ Voltage
$-25-50 \%$ of total power
- Scales with number of transistor
- To reduce:
- power gating



## Integrated Circuit Cost

- Integrated circuit

Cost of integrated circuit $=$ Cost of die + Cost of testing die + Cost of packaging and final test Final test yield

Cost of die $=\frac{\text { Cost of wafer }}{\text { Dies per wafer } x \text { Die yield }}$

## Reducing Power

- Techniques for reducing power:
- Do nothing well
- Dynamic Voltage-Frequency Scaling

- Low power state for DRAM, disks
- Overclocking, turning off cores


## Trends in Cost

- Cost driven down by learning curve
- Yield
- DRAM
- price closely tracks cost
- Microprocessors:
- price depends on volume
- $10 \%$ less for each doubling of volume


## Dependability

- Module reliability
- Mean time to failure (MTTF)
- Mean time to repair (MTTR)
- Mean time between failures $(\mathrm{MTBF})=\mathrm{MTTF}+$ MTTR
- Availability = MTTF $/$ MTBF

Dies per wafer $=\frac{\pi \times(\text { Wafer diameter } / 2)^{2}}{\text { Die area }}-\frac{\pi \times \text { Wafer diameter }}{\sqrt{2 \times \text { Die area }}}$

- Bose-Einstein formula:

Die yield $=$ Wafer yield $\times 1 /(1+\text { Defects per unit area } \times \text { Die area })^{N}$

- Defects per unit area $=0.016-0.057$ defects per square cm (2010)
- $\mathrm{N}=$ process-complexity factor $=11.5-15.5(40 \mathrm{~nm}, 2010)$


## Measuring Performance

- Typical performance metrics:

Response time
Throughput

- Speedup of X relative to Y

Execution time ${ }_{Y} /$ Execution time $_{X}$

- Execution time

Wall clock time: includes all system overheads
CPU time: only computation time

- Benchmarks

Kernels (e.g. matrix multiply)
Toy programs (e.g. sorting)
Synthetic benchmarks (e.g. Dhrystone)
Benchmark suites (e.g. SPEC06fp, TPC-C)

## Basic Performance Metrics

- Latency, delay, time
- Lower is better
- Complete a task as soon as possible
- Measured in sec, $\mu \mathrm{s}$, ns
- Throughput (bandwith)
- Higher is better
- Complete as many tasks per time as possible
- Measured in bytes/sec, instructions/sec
- Cost
- Lower is better
- Complete tasks for as little money as possible
- Measured in \$, TL, etc.


## Basic Performance Metrics

- Power
- Lower is better
- Complete tasks while dissipating as few joules/sec as possible
- Energy
- Lower is better
- Complete tasks using as few joules as possible
- Measured in Joules, Joules/instruction
- Reliability
- Higher is better
- Complete tasks with low probability of failure
- Measured in Mean time to failure (MTTF)
- MTTF: the average time until a failure occurs


## Bandwidth and Latency

- Bandwidth or throughput
- Total work done in a given time
- 32,000-40,000X improvement for processors
- 300-1200X improvement for memory and disks
- Latency or response time
- Time between start and completion of an event
- 50-90X improvement for processors
- $6-8 \mathrm{X}$ improvement for memory and disks


## Response Time vs Throughput

- Response time (latency)
- the time between the start and the completion of a task - Important to individual users ( passengers)
- Throughput (bandwidth)
- the total amount of work done in a given time - Important to data center managers (airline)
- Different performance metrics are required
- to benchmark embedded and desktop computers,
- which are more focused on response time,
- to benchmark servers,
- which are more focused on throughput


## Principles of Computer Design

- Take Advantage of Parallelism
- e.g. multiple processors, disks, memory banks, pipelining, multiple functional units
- Principle of Locality
- Reuse of data and instructions
- Focus on the Common Case
- Amdahl's Law



## Principles of Computer Design

- Different instruction types having different CPIs

$$
\begin{gathered}
\text { CPU clock cycles }=\sum_{i=1}^{n} I C_{i} \times C P I_{i} \\
\text { CPU time }=\left(\sum_{i=1}^{n} I C_{i} \times C P I_{i}\right) \times \text { Clock cycle time }
\end{gathered}
$$

## Principles of Computer Design

- The Processor Performance Equation CPU time $=$ CPU clock cycles for a program $\times$ Clock cycle time

$$
\text { CPU time }=\frac{\text { CPU clock cycles for a program }}{\text { Clock rate }}
$$

$$
C P I=\frac{C P U \text { clock cycles for a program }}{\text { Instruction count }}
$$

CPU time $=$ Instruction count $\times$ Cycles per instruction $\times$ Clock cycle time
$\frac{\text { Instructions }}{\text { Program }} \times \frac{\text { Clock cycles }}{\text { Instruction }} \times \frac{\text { Seconds }}{\text { Clock cycle }}=\frac{\text { Seconds }}{\text { Program }}=$ CPU time

## What is Performance?

- Purchasing perspective
- given a collection of machines, which has the
- best performance?
- least cost ?
- best performance / cost ?
- Design perspective
- faced with design options, which has the
- best performance improvement ?
- least cost ?
- best performance / cost ?
- Both require
- basis for comparison
- metric for evaluation
- Our goal is to understand cost \& performance implications of architectural choices


## Measurement and Evaluation



Levels of Representation


## Metrics of Performance



Each metric has a place and a purpose, and each can be misused

## Basis of Evaluation



## Which has Higher Performance?

| Plane | DC to Paris | Speed | Passengers | Throughput <br> $(\mathbf{p m p h})$ |
| :--- | :--- | :--- | :--- | :--- |
| Boeing 747 | 6.5 hours | 610 mph | 470 | 286,700 |
| Concorde | 3 hours | 1350 mph | 132 | 178,200 |

- Time to run the task (Execution Time)
- Execution time, response time, latency
- Tasks per day, hour, week, sec, ns ... (Performance)
- Throughput, bandwidth


## Performance: Example

- Time of Concorde vs. Boeing 747?
- Concorde is $1350 \mathrm{mph} / 610 \mathrm{mph}=2.2$ times faster

$$
=6.5 \text { hours } / 3 \text { hours }
$$

- Throughput of Concorde vs. Boeing 747 ?
- Concorde is $178,200 \mathrm{pmph} / 286,700 \mathrm{pmph}=0.62$ "times faster"
- Boeing is $286,700 \mathrm{pmph} / 178,200 \mathrm{pmph}=1.6$ "times faster"
- Boeing is 1.6 times (" $60 \%$ ") faster in terms of throughput
- Concorde is 2.2 times (" $120 \%$ ") faster in terms of flying time
- We will focus primarily on execution time for a single job


## Amdahl's Law

- The performance gain that can be obtained by improving some portion of a computer can be calculated using Amdahl's Law.
- Amdahl's Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
- Amdahl's Law defines the speedup that can be gained by using a particular feature.


## Amdahl's Law

Speedup due to enhancement E:


Suppose that enhancement E accelerates a fraction F of the task by a factor S , and the remainder of the task is unaffected
$\operatorname{ExTime}(\mathbf{w} / \mathbf{E})=(\mathbf{1}-\mathbf{F}) \times \operatorname{ExTime}(\mathbf{w} / \mathbf{o} \mathbf{E})+\mathbf{F} / \mathbf{S} \times \operatorname{ExTime}(\mathbf{w} / \mathbf{E})$

## Amdahl's Law

- What is speedup?
- Suppose that we can make an enhancement to a computer that will improve performance when it is used.
- Speedup is the ratio

Speedup $=\frac{\text { Performance for entire task using the enhancement when possible }}{\text { Performace for }}$

- Alternatively

Speedup $=\frac{\text { Execution time for entire task without using the enhancement }}{\text { Execution time for entire task using the enhancement when possible }}$

## Amdahl's Law



## Amdahl's Law: Example 1

Suppose that Floating point instructions are improved to run 2 X ; but only $10 \%$ of actual instructions are FP. What's the overall speedup gained?

$$
F=0.1 \quad S=2
$$

ExTime $_{\text {new }}=$
Speedup $_{\text {overall }}=$

## Amdahl's Law: Example 1

Suppose that Floating point instructions are improved to run 2 X ; but only $10 \%$ of actual instructions are FP. What's the overall speedup gained?

$$
\begin{gathered}
\text { F=0.1 } \quad \mathrm{S}=2 \\
\text { ExTime }_{\text {new }}=\text { ExTime }_{\text {old }} \times(0.9+0.1 / 2)=0.95 \times \text { ExTime }_{\text {old }} \\
\text { Speedup }_{\text {overall }}=
\end{gathered} \frac{1}{0.95}=1.053
$$

## Amdahl's Law: Example 2

Suppose that we are considering an enhancement to the processor of a server system used for web serving. The new CPU is 10 times faster on computation in web serving application than the original processor. Assuming that the original CPU is busy with computation $40 \%$ of the time and is waiting for I/0 $60 \%$ of time. What's the overall speedup gained by incorporating the enhancement?

$$
\begin{array}{ll}
\mathrm{F}=0.4 & \mathrm{~S}=10 \\
& \\
\text { Speedup(E) } & =1 /((1-0.4)+0.4 / 10) \\
& =1.56
\end{array}
$$

## Amdahl's law: Example 2

Suppose we want to design a processor for graphic applications. Suppose that FP square root (FPSQR) computation is responsible for $20 \%$ of execution time and all FP instructions are responsible for $50 \%$ of execution time.

```
- Alternative 1: Enhance FPSQR hardware by a factor of 10.
```

Speedup $($ FPSQR $)=\mathbf{1} /((\mathbf{1 - 0 . 2})+\mathbf{0 . 2} / \mathbf{1 0})=\mathbf{1 . 2 2}$

- Alternative 2: Enhance the speedup of all FP operations by a factor of 1.6 .
Speedup $(F P)=\mathbf{1} /((\mathbf{1 - 0 . 5})+\mathbf{0 . 5} / \mathbf{1 . 6})=\mathbf{1 . 2 8}$
Improving the performance of the FP operations overall is slightly better because of the higher frequency!


## CPU Equation


CPU time $=\frac{\text { \# Seconds }}{\text { Program }}=\frac{\# \text { Instructions }}{\text { Program }} \times \frac{\text { \# Cycles }}{\text { Instruction }} \times \frac{\text { \# Seconds }}{\text { Cycle }}$

|  | Inst. Count | CPI | Clock Rate |
| :--- | :---: | :---: | :---: |
| Program | X |  |  |
| Compiler | X | $(\mathrm{X})$ |  |
| Inst. Set. | X | X |  |
| Organization |  | X | X |
| Technology |  |  | X |

## CPU Equation: Example 1

Reg-Reg Architecture

| Op | Freq | Cycles | CPI(i) | \% Time |
| :--- | ---: | :---: | :---: | :---: |
| ALU | $\mathbf{5 0 \%}$ | $\mathbf{1}$ | $\mathbf{0 . 5}$ | $\mathbf{2 3 \%}$ |
| Load | $\mathbf{2 0 \%}$ | $\mathbf{5}$ | $\mathbf{1 . 0}$ | $\mathbf{4 5 \%}$ |
| Store | $\mathbf{1 0 \%}$ | $\mathbf{3}$ | $\mathbf{0 . 3}$ | $\mathbf{1 4 \%}$ |
| Branch | $\mathbf{2 0 \%}$ | $\mathbf{2}$ | $\mathbf{0 . 4}$ | $\mathbf{1 8 \%}$ |
|  | Typical Mix |  |  | $\mathbf{2 . 2}$ |
|  |  |  |  |  |

- How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
- How does this compare with using branch prediction to save a cycle off the branch time?
- What if two ALU instructions could be executed at once?


## CPI: Cycles Per Instruction

- Average Cycles per Instruction

$$
\begin{aligned}
& \mathrm{CPI}=(\text { CPU Time } * \text { Clock Rate }) / \text { Instruction Count } \\
&=\text { Cycles } / \text { Instruction Count } \\
& \text { CPU time }=\text { Cycle Time } \times \sum_{\mathrm{j}=1}^{n} \mathbf{C P I}_{\mathrm{j}} \times \mathbf{I}_{\mathrm{j}}
\end{aligned}
$$

- Instruction Frequency $F_{j}$



## CPU Equation: Example 2

- Suppose we have the following measurements:

$$
\begin{array}{ll}
\text { - Frequency of FP operations (others than FPSQR) } & =25 \% \\
\text { - Average CPI of FP operations } & =4 \\
\text { - Average CPI of others instructions } & =1.33 \\
\text { - Frequency of FPSQR } & =2 \% \\
- \text { CPI of FPSQR } & =20
\end{array}
$$

- Assume:
- Alternative 1: decrease the CPI of FPSQR to 2
- Alternative 2: decrease the average CPI of all operations to 2.5

Compare the two design alternatives using CPU performance equation.

## CPU Equation: Example 2 (cont'd)

- CPI_original $=75 \% * 1.33+25 \% * 4=2$
- CPI_newFPSQR = CPI_original $-2 \% *\left(C P I \_o l d F P S Q R ~-~\right.$
CPI_newFPSQR)

$$
=2-0.02 *(20-2)=1.64
$$

- CPI_newFP $=75 \% * 1.33+25 \% * 2.5=1.625$
- Speedup_newFP = CPU time original/CPU time new FP
= CPU_original / CPI_newFP

$$
=2.00 / 1.625=1.23
$$

- Speedup_newFPSQR $=2.00 / 1.64=1.22$


## SPEC: System Performance Evaluation Cooperative

- First Round 1989

10 programs yielding a single number ("SPECmarks")

- Second Round 1992

SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs)

- Compiler Flags unlimited. March 93 of DEC 4000 Model 610:
spice: unix.c:/def=(sysv, has_bcopy," ${ }^{\text {bcopy }}(\mathrm{a}, \mathrm{b}, \mathrm{c})=$
memcpy $(b, a, c)$ "
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/1c=blas
- Third Round 1995
new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point)
"benchmarks useful for 3 years"
Single flag setting for all programs: SPECint_base95, SPECfp_base95

CPU Equation: Example 3

- Assume CPI = 1.0 ignoring branches (ideal)
- Assume branch solution was delaying for 3 cycles
- If $30 \%$ branch, delay 3 cycles on $30 \%$

| Op | Freq | Cycles | CPI(i) (\% Time) |
| :--- | :---: | :---: | :---: | :---: |
| Other | $70 \%$ | 1 | $0.7(37 \%)$ |
| Branch | $30 \%$ | 4 | $1.2(63 \%)$ |
|  |  |  |  |
| W new | CPI $=1.9$ |  |  |

New machine is $1 / 1.9=0.52$ times faster (i.e. slow!)

## SPEC First Round

- One program: $99 \%$ of time in single line of code
- New front-end compiler could improve dramatically



## More Performance Metrics

- Arithmetic mean (weighted arithmetic mean) tracks execution time: $\sum\left(\mathrm{T}_{\mathrm{i}}\right) / \mathrm{n}$ or $\sum\left(\mathrm{W}_{\mathrm{i}} * \mathrm{~T}_{\mathrm{i}}\right)$
- Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: $\mathrm{n} / \sum\left(1 / \mathrm{R}_{\mathrm{i}}\right)$ or $\mathrm{n} / \sum\left(\mathrm{W}_{\mathrm{i}} / \mathrm{R}_{\mathrm{i}}\right)$
- Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10)
- But do not take the arithmetic mean of normalized execution time, use the geometric mean $\left(\Pi\left(\mathrm{R}_{\mathrm{i}}\right)^{\wedge} 1 / \mathrm{n}\right)$


## Impact of Means on SPECmark89 for IBM 550

| Program | Ratio to VAX |  |  | Time | Weig | ed Time |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Befor | After | Before | After | Before | After |
| gcc 30 | 29 | 49 | 51 | 8.91 | 9.22 |  |
| espresso35 | 34 | 65 | 67 | 7.64 | 7.86 |  |
| spice | 47 | 47 | 510 | 510 | 5.69 | 5.69 |
| doduc | 46 | 49 | 41 | 38 | 5.81 | 5.45 |
| nasa7 | 78 | 144 | 258 | 140 | 3.43 | 1.86 |
| 1 l | 34 | 34 | 183 | 183 | 7.86 | 7.86 |
| eqntott | 40 | 40 | 28 | 28 | 6.68 | 6.68 |
| matrix 300 | 78 | 730 | 58 | 6 | 3.43 | 0.37 |
| fpppp | 90 | 87 | 34 | 35 | 2.97 | 3.07 |
| tomcatv | 33 | 138 | 20 | 19 | 2.01 | 1.94 |
| Mean | 54 | 72 | 124 | 108 | 54.42 | 49.99 |
|  | Geometric |  | Arithmetic |  | Weighted Arith. |  |
|  | Ratio | 1.33 | Ratio | 1.16 | Ratio | 1.09 |

## SPEC Performance: Summary

- "For better or worse, benchmarks shape a field"
- Good products created when have:
- Good benchmarks
- Good ways to summarize performance
- Given sales is a function in part of performance relative to competition, investment in improving product as reported by performance summary
- If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales;
Sales almost always wins!
- Execution time is the measure of computer performance!


## Performance Metrics: Summary

- Design-time metrics:
- Can it be implemented, in how long, at what cost?
- Can it be programmed? Ease of compilation?
- Static metrics:
- How many bytes does the program occupy in memory?

Dynamic metrics:

- How many instructions are executed?
- How many bytes does the processor fetch to execute the program?
- How many clocks are required per instruction?
- How "lean" a clock is practical?
- Best Metric: Time to execute the program!
- Depends on instructions set, processor organization, and compiler


## Fallacies and Pitfalls

- All exponential laws must come to an end
- Dennard scaling (constant power density)
- Stopped by threshold voltage
- Disk capacity
- 30-100\% per year to 5\% per year
- Moore's Law
- Most visible with DRAM capacity
- ITRS disbanded
- Only four foundries left producing state-of-the-art logic chips
- $11 \mathrm{~nm}, 3 \mathrm{~nm}$ might be the limit


## Fallacies and Pitfalls

- The rated mean time to failure of disks is $1,200,000$ hours or almost 140 years, so disks practically never fail
- MTTF value from manufacturers assume regular replacement
- Peak performance tracks observed performance
- Fault detection can lower availability
- Not all operations are needed for correct execution


## Fallacies and Pitfalls

- Microprocessors are a silver bullet
- Performance is now a programmer's burden
- Falling prey to Amdahl's Law
- A single point of failure
- Hardware enhancements that increase performance also improve energy efficiency, or are at worst energy neutral
- Benchmarks remain valid indefinitely
- Compiler optimizations target benchmarks


## A Relative Performance Example

- If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds,
- Which computer is faster?
- How much faster?
- We know that A is n times faster than B if

- The performance ratio n is $15 / 10=1.5$
- So A is 1.5 times ( $50 \%$ ) faster than B


## Ratios of Measure: Side Note

- For bigger-is-better metrics,
- improved means increase
- $\mathrm{V}_{\text {new }}=2.5 * \mathrm{~V}_{\text {old }}$
- A metric increased by 2.5 times (sometimes written 2.5 x )
- A metric increased by $150 \%$ ( $\mathrm{x} \%$ increase $=0.01 * \mathrm{x}+1$ times increase)
- For smaller-is-better metrics,
- improved means decrease
- e.g., Latency improved by $2 x$, means latency decreased by $2 x$ (i.e., dropped by $50 \%$ )
- e.g., Battery life worsened by $50 \%$, means battery life decrease by $50 \%$.


## Clock Cycle and Clock Rate

- A clock cycle is a single electronic pulse of a CPU
- To synchronize different parts of the circuit
- To determine when events take place in the hardware
- Processor runs at a constant clock rate
- Clock cycle or tick or cycle $=$ Discrete time interval
- Clock rate (frequency)
- Number of clock cycles per second in hertz

- 1 nsec $\left(10^{-9}\right)$ clock cycle $=>1 \mathrm{GHz}\left(10^{9}\right)$ clock rate
- 0.5 nsec clock cycle $=>2 \mathrm{GHz}$ clock rate


## Examples

- Bigger-is-better examples
- Bandwidth per dollar (e.g., in networking (GB/s)/\$)
- BW/Watt (e.g., in memory systems (GB/s)/W)
- Work/Joule (e.g., instructions/joule)
- In general
- Multiply by big-is-better metrics, divide by smaller-is-better metrics
- Smaller-is-better examples
- Cycles/Instruction (i.e., Time per work)
- Latency * Energy -- Energy Delay Product
- In general:
- Multiply by smaller-is-better metrics, divide by bigger-isbetter metrics

CPU Time (Execution Time)

- A program takes $15 \times 10^{10}$ cycles to execute on a computer with a clock cycle time of 500 picosec.
- How many seconds does it take for the program to execute?

CPU Time $=$ Clock Cycles $\times$ Clock Cycle Time
$=\frac{\text { Clock Cycles }}{\text { Clock Rate }}$

- Clock Cycles:
- How many cycles it takes for a program to execute!
- CPU Time (Execution time):
- How many seconds it takes for a program to execute!


## CPU Time Example

- Computer A has a 2 GHz clock rate, executes a program in 10 sec (CPU time)
- Designing Computer B by aiming for 6 sec CPU time
- With a faster clock, but this causes $1.2 \times$ clock cycles
- What is Computer B's clock rate?

$$
\begin{gathered}
\text { Clock Rate }_{\mathrm{B}}=\frac{\text { Clock Cycles }_{\mathrm{B}}}{\text { CPU Time }}=\frac{1.2 \times \text { Clock Cycles }_{\mathrm{B}}}{6 \mathrm{~s}} \\
\text { Clock Cycles }_{\mathrm{A}}=\text { CPU Time } \\
\mathrm{A} \times \operatorname{Clock~Rate}_{\mathrm{A}}=10 \mathrm{~s} \times 2 \mathrm{GHz}=20 \times 10^{9} \\
\text { Clock Rate }_{\mathrm{B}}=\frac{1.2 \times 20 \times 10^{9}}{6 \mathrm{~s}}=\frac{24 \times 10^{9}}{6 \mathrm{~s}}=4 \mathrm{GHz}
\end{gathered}
$$

## Comparing Computers

- Computers A and B implement the same ISA.
- Computer A has a clock cycle time of 250 ps and an effective CPI of 2.0 for some program C
- Computer B has a clock cycle time of 500 ps and an effective CPI of 1.2 for the same program.
- Which computer is faster and by how much?

Clock Cycles $=$ Instruction Count $\times$ Cycles per Instruction CPU Time $=$ Instruction Count $\times$ CPI $\times$ Clock Cycles Time
$=\underline{\text { Instruction Count } \times \text { CPI }}$
Clock Rate

## Comparing Computers

Clock Cycles $=$ Instruction Count $\times$ Cycles per Instruction CPU Time $=$ Instruction Count $\times$ CPI $\times$ Clock Cycle Time

- Each computer executes the same number of instructions, I, so
CPU time ${ }_{\mathrm{A}}=\mathrm{I} \times 2.0 \times 250 \mathrm{ps}=500 \times \mathrm{Ips}$
CPU time ${ }_{\mathrm{B}}=\mathrm{I} \times 1.2 \times 500 \mathrm{ps}=600 \times \mathrm{Ips}$
- Clearly, A is faster than B by the ratio of execution times

| performance ${ }_{\text {A }}$ | execution_time ${ }_{\text {B }}$ | 600 x Ips |
| :---: | :---: | :---: |
| performance ${ }_{\text {B }}$ | execution_time ${ }_{\text {A }}$ | $500 \times \mathrm{Ips}$ |

