## **Computer Architecture**

Prof. Dr. Nizamettin AYDIN

naydin@yildiz.edu.tr nizamettinaydin@gmail.com

http://www.yildiz.edu.tr/~naydin

# Memory Hierarchy

#### Outline

- Introduction
- Characteristics
- Memory hierarchy
- Cache
- Mapping function
- Replacement algorithms
- Write policy
- Examples



#### Introduction

- Memory lies at the heart of the storedprogram computer.
- In this lecture, we focus on memory organization and architecture.
- A clear understanding of these ideas is essential for the analysis of system performance.



#### Characteristics

- Location
- Capacity
- Unit of transfer
- · Access method
- Performance
- · Physical type
- Physical characteristics
- Organisation

#### Characteristics

- Location
  - CPU is the reference
  - Internal
  - External
- Capacity
  - Word sizeThe natural unit of organisation
  - Number of words
    - or Bytes

#### **Characteristics**

#### Unit of Transfer

- Internal
  - Usually governed by data bus width
- External
  - Usually a block which is much larger than a word
- Addressable unit
  - Smallest location which can be uniquely addressed
  - Word internally
  - · Cluster on disks

#### Characteristics

#### Access Methods...

- Sequential
  - Start at the beginning and read through in order
  - Access time depends on location of data and previous location
  - e.g. tape

#### - Direct

- · Individual blocks have unique address
- · Access is by jumping to vicinity plus sequential search
- Access time depends on location and previous location – e.g. disk

## Characteristics

#### ...Access Methods

- Random
  - Individual addresses identify locations exactly
  - Access time is independent of location or previous access e.g. RAM
- Associative
  - Data is located by a comparison with contents of a portion of the store
  - Access time is independent of location or previous access e.g. cache

## Characteristics

#### • Performance

- Access time
  - Time between presenting the address and getting the valid data
- Memory Cycle time
  - Time may be required for the memory to recover before next access
  - Cycle time is access + recovery
- Transfer Rate
  - Rate at which data can be moved





35 ns

4 Gb

2012



Characteristics

- It is possible to build a computer which uses only static

• The Bottom Line

- How much?

- How fast?

Capacity

• Time is money

· This would be very fast

· This would need no cache

· This would cost a very large amount

- How expensive?

· So you want fast?

RAM



Copyright 2000 N. AYDIN. All rights reserved.

## 3

#### **Memory Hierarchy**

- Why have memory hierarchy?
  - We want both fast and large memory
  - But we cannot achieve both with a single level of memory
- Idea
  - to have multiple levels of storage
    - progressively bigger and slower as the levels are farther from the processor
  - to ensure most of the data the processor needs is kept in the fast(er) level(s)

## **Memory Hierarchy**

- Registers
- Internal or Main memory

   May include one or more levels of cache
   "RAM"
- External memory

   Backing store
- This storage organization can be thought of as a pyramid







Copyright 2000 N. AYDIN. All rights reserved.













Copyright 2000 N. AYDIN. All rights reserved.

#### Cache

- The purpose of cache memory
   - to speed up accesses by storing recently used
   data closer to the CPU
- · It is much smaller than main memory
- Its access time is a fraction of that of main memory.
- main memory is accessed by address,
- cache is typically accessed by content;
   hence, it is often called content addressable memory
  - Because of this, a single large cache memory isn't always desirable
    - it takes longer to search

#### Cache

- The content that is addressed in content addressable cache memory is a subset of the bits of a main memory address called a field.
  - The fields into which a memory address is divided provide a many-to-one mapping between larger main memory and the smaller cache memory.
    - Many blocks of main memory map to a single block of cache.
    - A tag field in the cache block distinguishes one cached memory block from another.









#### **Elements of Cache Design**

- Cache Addresses
  - Logical
  - Physical
- Cache Size
- Mapping Function
  - Direct
  - Associative
  - Set associative
- Replacement Algorithm
  - Least recently used
    - First in first out

- Least frequently
   used
- Random
- Write Policy
  - Write through
  - Write back
- Block Size
- Number of Caches
  - Single or two level
  - Unified or split
- Unified
- out

#### **Cache Addresses**

- Almost all nonembedded processors, and many embedded processors, support virtual Memory
  - a facility that allows programs to address memory from a logical point of view, without regard to the amount of main memory physically available.
- When virtual memory is used, the address fields of machine instructions contain virtual addresses.
- For reads to and writes from main memory, a hardware memory management unit (MMU) translates each virtual address into a physical address in main memory.

#### **Cache Addresses**

- When virtual addresses are used, the system designer may choose to place the cache between the processor and the MMU or between the MMU and main memory.
- A logical cache, also known as a virtual cache, stores data using virtual addresses.
  - The processor accesses the cache directly, without going through the MMU.
- A physical cache stores data using main memory physical addresses.



## Cache Size does matter

- We would like the size of the cache to be small enough so that the overall average cost per bit is close to that of main memory alone and large enough so that the overall average access time is close to that of the cache alone.
- Cost
  - More cache is expensive
- Speed
  - More cache is faster (up to a point)
  - Checking cache for data takes time

#### **Comparison of Cache Sizes**

| Processor       | Type                              | Year of Introduction | L1 cachea     | L2 cache       | L3 cache |
|-----------------|-----------------------------------|----------------------|---------------|----------------|----------|
| IBM 360/85      | Mainframe                         | 1968                 | 16 to 32 KB   | -              | -        |
| PDP-11/70       | Minicomputer                      | 1975                 | 1 KB          | -              |          |
| VAX 11/780      | Minicomputer                      | 1978                 | 16 KB         | -              | _        |
| IBM 3033        | Mainframe                         | 1978                 | 64 KB         | -              | _        |
| IBM 3090        | Mainframe                         | 1985                 | 128 to 256 KB | -              | _        |
| Intel 80486     | PC                                | 1989                 | 8 KB          | -              | -        |
| Pentium         | PC                                | 1993                 | 8 KB/8 KB     | 256 to 512 KB  |          |
| PowerPC 601     | PC                                | 1993                 | 32 KB         | -              |          |
| PowerPC 620     | PC                                | 1996                 | 32 KB/32 KB   | -              | -        |
| PowerPC G4      | PC/server                         | 1999                 | 32 KB/32 KB   | 256 KB to 1 MB | 2 MB     |
| IBM S/390 G4    | Mainframe                         | 1997                 | 32 KB         | 256 KB         | 2 MB     |
| IBM S/390 G6    | Mainframe                         | 1999                 | 256 KB        | 8 MB           | -        |
| Pentrum 4       | PC/server                         | 2000                 | 8 KB/8 KB     | 256 KB         | -        |
| IBM SP          | High-end server/<br>supercomputer | 2000                 | 64 KB/32 KB   | 8 MB           | _        |
| CRAY MTAb       | Supercomputer                     | 2000                 | 8 KB          | 2 MB           | -        |
| Itanium         | PC/server                         | 2001                 | 16 KB/16 KB   | 96 KB          | 4 MB     |
| SGI Origin 2001 | High-end server                   | 2001                 | 32 KB/32 KB   | 4 MB           | -        |
| Itanium 2       | PC/server                         | 2002                 | 32 KB         | 256 KB         | 6 MB     |
| IBM POWER5      | High-end server                   | 2003                 | 64 KB         | 1.9 MB         | 36 MB    |
| CRAY XD-1       | Supercomputer                     | 2004                 | 64 KB/64 KB   | 1MB            | _        |

#### **Mapping Function**

- Because there are fewer lines than main memory blocks, an algorithm is needed for mapping main memory blocks into cache lines.
- Which main memory block currently occupies a cache line?
- Three techniques can be used:
  - Direct mapping
  - Associative mapping
  - Set associative mapping

#### **Example System for Mapping Functions**

- For all three cases, the example includes the following elements:
- Cache can hold 64 kBytes
- Data are transferred between main memory and the cache in blocks of 4 bytes
  - i.e. cache is  $16k = 2^{14}$  lines of 4 bytes each
- Main memory consists of 16 MBytes - 16 M = 2<sup>24</sup>, each byte directly adressable by 24 bit address
  - So, we consider main memory to consist of 4 M blocks of 4 bytes each







# **Direct Mapping Cache Line Table**

• The effect of this mapping is that blocks of main memory are assigned to lines of the cache as follows:

| Cache line | Main Memory blocks held |
|------------|-------------------------|
| 0          | 0, m, 2m, 3m2s-m        |
| 1          | 1,m+1, 2m+12s-m+1       |
|            |                         |
| •          |                         |
|            |                         |
| m-1        | m-1, 2m-1,3m-12s-1      |





#### **Direct Mapping Summary**

- Address length = (s + w) bits
- Number of addressable units = 2<sup>s+w</sup> words or bytes
- Block size = line size =  $2^{w}$  words or bytes
- Number of blocks in main memory  $= 2^{s+w/2w} = 2^s$
- Number of lines in cache =  $m = 2^r$
- Size of tag = (s r) bits

## **Direct Mapping pros & cons**

- Simple
- Inexpensive
- Fixed location for given block
  - If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high

## **Associative Mapping**

- Instead of placing memory blocks in specific cache locations based on memory address,
- we could allow a block to go anywhere in cache.
- In this way, cache would have to fill up before any blocks are evicted.
- This is how fully associative cache works.
- A memory address is partitioned into only two fields: the tag and the word.



Cache searching gets expensive







#### **Associative Mapping Summary**

- Address length = (s + w) bits
- Number of addressable units = 2<sup>s+w</sup> words or bytes
- Block size = line size =  $2^{w}$  words or bytes
- Number of blocks in main memory  $= 2^{s+w/2^w} = 2^s$
- Number of lines in cache = undetermined
- Size of tag = s bits





Copyright 2000 N. AYDIN. All rights reserved.



## Set Associative Mapping Example

- 13 bit set number
- Block number in main memory is modulo 2<sup>13</sup>
- 000000, 00A000, 00B000, 00C000 ...
   map to same set



| Tag 9 bits       | Set 13 bits  |             | Word<br>2 bits |
|------------------|--------------|-------------|----------------|
| Use set field    | to determin  | e cache set | t to look in   |
|                  |              |             |                |
| Compare tag      | field to see | if we have  | e a hit        |
| – e.g            |              |             |                |
| - e.g<br>Address | Tag          | Data        | Set number     |
| – e.g            |              | Data        | Set number     |





#### **Replacement Algorithms**

- Direct mapping
  - Each block only maps to one line
  - Replace that line
- Associative & Set Associative mapping • These are hardware implemented algorithms (speed)
  - Least Recently used (LRU)
  - First in first out (FIFO)
  - · replace block that has been in cache longest
  - Least frequently used
    - replace block which has had fewest hits
  - Random

#### Write Policy

- Must not overwrite a cache block unless main memory is up . to date
- Multiple CPUs may have individual caches
- · I/O may address main memory directly
- Two techniques

#### Write through

- All writes go to main memory as well as cache
- Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache un to date
- Lots of traffic
- · Slows down writes

#### Write back

- · Updates initially made in cache only Undate bit for cache slot is set when undate occurs
- If block is to be replaced, write to main memory only if update bit is set
- · Other caches get out of sync
- I/O must access main memory through cache

#### **Pentium 4 Cache**

- 80386 no on chip cache
- 80486 8k using 16 byte lines and four way set associative . organization
- Pentium (all versions) two on chip L1 caches Data & instruction
- Pentium III L3 cache added off chip
- Pentium 4

•

- L1 caches
- 8k bytes
  - 64 byte lines
- · four way set associative L2 cache
- · Feeding both L1 caches
- 256k
- 128 byte lines
- 8 way set associative - L3 cache on chip

## **Intel Cache Evolution**

| Problem                                                                                                                                                                                                                        | Solution                                                                                                                                        | Processor on which feature<br>first appears |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------|
| External memory slower than the system bus.                                                                                                                                                                                    | Add external cache using faster<br>memory technology.                                                                                           | 386                                         |
| Increased processor speed results in external bus<br>becoming a bottleneck for cache access.                                                                                                                                   | Move external cache on-chip,<br>operating at the same speed as the<br>processor.                                                                | 486                                         |
| Internal cache is rather small, due to limited space on<br>chip                                                                                                                                                                | Add external L2 cache using faster<br>technology than main memory                                                                               | 486                                         |
| Contention occurs when both the Instruction Prefetcher<br>and the Execution Unit simultaneously require access to<br>the cache. In that case, the Prefetcher is stalled while the<br>Execution Unit's data access takes place. | Create separate data and instruction caches.                                                                                                    | Pentium                                     |
| Increased processor speed results in external bus<br>becoming a bottleneck for L2 cache access.                                                                                                                                | Create separate back-side bus that<br>runs at higher speed than the main<br>(front-side) external bus. The BSB<br>is dedicated to the L2 cache. | Pentium Pro                                 |
|                                                                                                                                                                                                                                | Move L2 cache on to the<br>processor chip.                                                                                                      | Pentium II                                  |
| Some applications deal with massive databases and must<br>have rapid access to large amounts of data. The on-chip<br>caches are too small                                                                                      | Add external L3 cache.                                                                                                                          | Pentium III                                 |
| caciles are too siliali.                                                                                                                                                                                                       | Move L3 cache on-chip.                                                                                                                          | Pentium 4                                   |





Copyright 2000 N. AYDIN. All rights reserved.

## Pentium 4 Design Reasoning

- · Decodes instructions into RISC like micro-ops before L1 cache
- Micro-ops fixed length
- Superscalar pipelining and scheduling
- Pentium instructions long & complex
- Performance improved by separating decoding from scheduling & pipelining
   (More later ch14)
- Data cache is write back
- Can be configured to write through
- L1 cache controlled by 2 bits in register
   CD = cache disable
  - OB = cache disable
     NW = not write through
  - 2 instructions to invalidate (flush) cache and write back then invalidate
- L2 and L3 8-way set-associative
  - Line size 128 bytes

## **PowerPC Cache Organization**

- 601 single 32kb 8 way set associative
- 603 16kb (2 x 8kb) two way set associative
- 604 32kb
- 620 64kb
- G3 & G4
- 64kb L1 cache
  - 8 way set associative
  - 256k, 512k or 1M L2 cache
  - two way set associative
- G5
  - 32kB instruction cache
  - 64kB data cache



#### **Internet Sources**

- Manufacturer sites
  - Intel
  - IBM
  - Motorola
- · Search on cache

