Introduction to Algorithms July 13, 2009

# MIT's Introduction to Algorithms, Lectures 22 and 23: Cache Oblivious Algorithms

<- previous article next article ->

This is a happy and sad moment at the same time - I have finally reached the last two lectures of MIT's undergraduate algorithms course. These last two lectures are on a fairly new area of algorithm research called "cache oblivious algorithms."

Cache-oblivious algorithms take into account something that has been ignored in all the lectures so far, particularly, the multilevel memory hierarchy of modern computers. Retrieving items from various levels of memory and cache make up a dominant factor of running time, so for speed it is crucial to minimize these costs. The main idea of cache-oblivious algorithms is to achieve optimal use of caches on all levels of a memory hierarchy without knowledge of their size.

Cache-oblivious algorithms should not be confused with cache-aware algorithms. Cache-aware algorithms and data structures explicitly depend on various hardware configuration parameters, such as the cache size. Cache-oblivious algorithms do not depend on any hardware parameters. An example of cache-aware (not cache-oblivious) data structure is a B-Tree that has the explicit parameter B, the size of a node. The main disadvantage of cache-aware algorithms is that they are based on the knowledge of the memory structure and size, which makes it difficult to move implementations from one architecture to another. Another problem is that it is very difficult, if not impossible, to adapt some of these algorithms to work with multiple levels in the memory hierarchy. Cache-oblivious algorithms solve both problems.

Lecture twenty-two introduces the terminology and notation used in cache-oblivious algorithms, explains the difference between cache-oblivious and cache-aware algorithms, does a simple memory analysis of several simple algorithms and culminates with a cache-oblivious algorithm for matrix multiplication.

The final lecture twenty-three is the most difficult in the whole course and shows cache-oblivious binary search trees and cache-oblivious sorting called funnel sort.

Use this supplementary reading material by professor Demaine to understand the material better: Cache-oblivious algorithms and data structures (.pdf).

## Lecture 22: Cache Oblivious Algorithms I

Lecture twenty-two starts with an introduction to the modern memory hierarchy (CPU cache L1, L2, L3, main memory, disk cache, etc.) and with the notation and core concepts used in cache-oblivious algorithms.

A powerful result in cache-oblivious algorithm design is that if an algorithm is efficient on two levels of cache, then it's efficient on any number of levels. Thus the study of cache-obliviousness can be simplified to two-level memory hierarchy, say the CPU cache and main memory, where the accesses to cache are instant but are orders of magnitude slower to main memory. Therefore the main question cache-oblivious algorithm analysis tries to address is how many memory transfers (MTs) does a problem of size N take. The notation used for this is MT(N). For an algorithm to be efficient, the number of memory transfers should be as small as possible.

Next the lecture analysis the number of memory transfers for basic array scanning and array reverse algorithms. Since array scanning is consequential, N elements can be processed with O(N/B) accesses, where B is the block size - number of elements that are automatically fetched as N-th element is accessed. That is MT(N) = O(N/B) for array scanning. The same bound holds for reversing an array, since it can be viewed two scans - one from the beginning and one from the end.

Next it's shown that the classical binary search (covered in lecture 3) is not cache efficient, but order statistics problem (covered in lecture 6) is cache efficient.

Finally the lecture describes a cache efficient way to multiply matrices by storing them block-wise in memory.

You're welcome to watch lecture twenty-two:

Topics covered in lecture twenty-two:

• [00:10] Introduction and history of cache-oblivious algorithms.
• [02:00] Modern memory hierarchy in computers: Caches L1, L2, L3, main memory, disk cache.
• [06:15] Formula for calculating the cost to access a block of memory.
• [08:18] Amortized cost to access one element in memory.
• [11:00] Spatial and temporal locality of algorithms.
• [13:45] Two-level memory model.
• [16:30] Notation: total cache size M, block size B, number of blocks M/B.
• [20:40] Notation: MT(N) - number of memory transfers of a problem of size N.
• [21:45] Cache-aware algorithms.
• [22:50] Cache-oblivious algorithms.
• [28:35] Blocking of memory.
• [32:45] Cache-oblivious scanning algorithm (visitor pattern).
• [36:20] Cache-oblivious Array-Reverse algorithm.
• [39:05] Memory transfers in classical binary search algorithm.
• [43:45] Divide and conquer algorithms.
• [45:50] Analysis of memory transfers in order statistics algorithm.
• [01:00:50] Analysis of classical matrix multiplication (with row major, column major memory layout).
• [01:07:30] Cache oblivious matrix multiplication.

Lecture twenty-two notes:

 Lecture 22, page 1 of 2: modern memory hierarchy, spacial and temporal locality, two-level memory model, blocking of memory, basic algorithms: parallel scan. Lecture 22, page 2 of 2: basic algorithms: binary search, divide and conquer algorithms, order statistics, matrix multiplication, block algorithms.

## Lecture 23: Cache Oblivious Algorithms II

This was probably the most complicated lecture in the whole course. The whole lecture is devoted to two subjects - cache-oblivious search trees and cache-oblivious sorting.

While it's relatively easy to understand the design of cache-oblivious way of storing search trees in memory, it's amazingly difficult to understand the cache-efficient sorting. It's called funnel sort which is basically an n-way merge sort (covered in lecture 1) with special cache-oblivious merging function called k-funnel.

You're welcome to watch lecture twenty-three:

Topics covered in lecture twenty-three:

• [01:00] Cache-oblivious static search trees (binary search trees).
• [09:35] Analysis of static search trees.
• [18:15] Cache-aware sorting.
• [19:00] Sorting by repeated insertion in binary tree.
• [21:40] Sorting by binary merge sort.
• [31:20] Sorting by N-way mergesort.
• [36:20] Sorting bound for cache-oblivious sorting algorithms.
• [38:30] Cache-oblivious sorting.
• [41:40] Definition of K-Funnel (cache-oblivious merging).
• [43:35] Funnel sort.
• [54:05] Construction of K-Funnel.
• [01:03:10] How to fill buffer in k-funnel.
• [01:07:30] Analysis of fill buffer.

Lecture twenty-three notes:

 Lecture 23, page 1 of 2: static search trees, cache aware sorting. Lecture 23, page 2 of 2: cache oblivious sorting, k-funnels, funnel sort, fill-buffer algorithm and analysis.

Have fun with the cache oblivious algorithms! I'll do a few more posts that will summarize all these lectures and highlight key ideas.

If you loved this, please subscribe to my blog!

<- previous article next article ->

Is there any lecture on complexity analysis of algorithms ?

I'm curious... Why not have the application query the hardware to find out its capabilities, and then tweak the cache-aware algorithms accordingly?

@Wolter: I can think of two reasons.

1. The immediately practical reason: You'd need to recompile the code to achieve best performance (treating the cache size as a constant and not something that varies at run time). You could do this though and there might even exist a sufficiently advanced JIT somewhere.

2. The real reason: Because you can write a cache oblivious algorithm and have it perform best all the time without having to worry about that. The cache-aware-with-adjusting actually requires more complexity. The cache oblivious way "just works."

> You’d need to recompile the code to achieve best performance (treating the cache size as a constant and not something that varies at run time).

But how significant would the performance impact be? I'm wondering whether a dynamic cache-aware algorithm would give better performance than a cache-oblivious one. It would seem reasonable that a cache-aware algorithm would be capable of making more intelligent decisions.

> Because you can write a cache oblivious algorithm and have it perform best all the time without having to worry about that. The cache-aware-with-adjusting actually requires more complexity. The cache oblivious way “just works.”

But is that in fact true? Do cache-oblivious algorithms actually outperform cache-aware algorithms? Or are they just a compromise?

Fairly advanced stuff... I am fascinated by Open Course Ware.

Are there any lectures covering branch and bound algorithms?

On the first page of your lecture notes for Lecture 23, in the analysis of static search trees, you have that the triangle (subtree) ≤ B is ≥ than log B. I think it should be the triangle ≤ B is ≥ than 1/2 log B. But other than that, thank you again for posting these, they really help when combined with the blackboard notes.

Sapping, oh yes, 1/2 log B. I somehow missed the 1/2 fraction. Thanks for spotting that!

It looks like the videos have been made private. Can you please put it back to the public domain? Thanks!

where are the videos？