Follow me on Twitter for my latest adventures!
MIT's Introduction to Algorithms, Lectures 20 and 21: Parallel Algorithms
This is the thirteenth post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lectures twenty and twentyone on parallel algorithms. These lectures cover the basics of multithreaded programming and multithreaded algorithms.
Lecture twenty begins with a good overview of multithreaded programming paradigm, introduces to various concepts of parallel programming and at the end talks about the Cilk programming language.
Lecture twentyone implements several multithreaded algorithms, such as nxn matrix addition, nxn matrix multiplication, and parallel mergesort. It then goes into great detail to analyze the running time and parallelism of these algorithms.
Lecture 20: Parallel Algorithms I
The lecture starts with an introduction to a particular model of parallel computation called dynamic multithreading. It's easiest to understand this model with an example. Here is a multithreaded algorithm that computes Fibonacci numbers:
Fibonacci(n): 1 if n < 2:  thread A 2 return n  thread A 3 x = spawn Fibonacci(n1)  thread A 4 y = spawn Fibonacci(n2)  thread B 5 sync  thread C 6 return x + y  thread C # this is a really bad algorithm, because it's # exponential time. # # see lecture three for four different # classical algorithms that compute Fibonacci numbers
(A thread is defined to be a maximal sequence of instructions not containing the parallel control statements spawn, sync, and return, not something like Java threads or Posix threads.)
I put the most important keywords in bold. They are "spawn" and "sync".
The keyword spawn is the parallel analog of an ordinary subroutine call. Spawn before the subroutine call in line 3 indicates that the subprocedure Fibonacci(n1) can execute in parallel with the procedure Fibonacci(n) itself. Unlike an ordinary function call, where the parent is not resumed until after its child returns, in the case of a spawn, the parent can continue to execute in parallel with the child. In this case, the parent goes on to spawn Fibonacci(n2). In general, the parent can continue to spawn off children, producing a high degree of parallelism.
A procedure cannot safely use the return values of the children it has spawned until it executes a sync statement. If any of its children have not completed when it executes a sync, the procedure suspends and does not resume until all of its children have completed. When all of its children return, execution of the procedure resumes at the point immediately following the sync statement. In the Fibonacci example, the sync statement in line 5 is required before the return statement in line 6 to avoid the anomaly that would occur if x and y were summed before each had been computed.
The spawn and sync keywords specify logical parallelism, not "actual" parallelism. That is, these keywords indicate which code may possibly execute in parallel, but what actually runs in parallel is determined by a scheduler, which maps the dynamically unfolding computation onto the available processors.
The lecture then continues with a discussion on how a parallel instruction stream can be viewed as a directed acyclic graph G = (V, E), where threads make up the set V of vertices, and each spawn and return statement makes up an edge in E. There are also continuation edges in E that exist between statements.
In Fibonacci example there are three threads: lines 1, 2, 3 make up the thread A, line 4 is thread B and lines 5, 6 are thread C.
Here is how a graph of Fibonacci(4) computation looks like:
A dag representing multithreaded computation of Fibonacci(4).
Here the threads are show as circles, and each group of threads belonging to the same procedure are surrounded by a rounded rectangle. Downward edges are spawn dependencies, horizontal edges represent continuation dependencies within a procedure, and upward edges are return dependencies. Every computation starts with a single initial thread, and ends with a single final thread.
So to compute Fib(4), first Fib(3) and Fib(2) have to be computed. Fib(4) spawns one thread to compute Fib(3) and another thread to compute Fib(2) and sync'es. In order to compute Fib(3), values of Fib(2) and Fib(1) have to be known, and to compute Fib(2) values of Fib(1) and Fib(0) have to be known. Fib(3) and Fib(2) spawn their threads and sync. This way the computation unwinds.
Next, the lecture introduces some concepts for measuring performance of multithreaded algorithms.
First some time measures:
 T_{p}  running time of an algorithm on p processors,
 T_{1}  running time of algorithm on 1 processor, and
 T_{∞} as critical path length, that is the longest time to execute the algorithm on infinite number of processors.
Based on these measures, the speedup of a computation on p processors is T_{1}/T_{p}, that is how much faster a p processor machine executes the algorithm than a one processor machine. If the speedup is O(p), then we say that the computation has a linear speedup. If, however, speedup > p, then we call it a superlinear speedup. The maximum possible speedup is called parallelism and is computed by T_{1}/T_{∞}.
After these concepts the lecture puts forward the concept of scheduling and greedy scheduler.
A scheduler maps computation to p processors.
The greedy scheduler schedules as much as it can at every time step. On a pprocessor machine every step can be classified into two types. If there are p or more threads ready to execute, the scheduler executes any p of them (this step is also called a complete step). If there are fewer than p threads, it executes all of them (called incomplete step). The greedy scheduler is an offline scheduler and it's 2competitive (see lecture 13 on online/offline algorithms and the meaning of competitive algorithm).
The lecture ends with an introduction to Cilk parallel, multithreaded programming language. It uses a randomized online scheduler. It was used to write two chess programs *Socrates and Cilkchess.
See Charles Leiseron's handout for a much more comprehensive overview of dynamic multithreading: A Minicourse on Dynamic Multithreaded Algorithms (pdf).
You're welcome to watch lecture twenty:
Topics covered in lecture twenty:
 [01:33] Parallel algorithms.
 [02:33] Dynamic multithreading.
 [03:15] A multithreaded algorithm for fibonnaci numbers.
 [04:50] Explanation of spawn and sync.
 [06:45] Logical parallelism.
 [08:00] Multithreaded computation.
 [12:15] Fibonacci(4) computation graph, and its concepts  init thread, spawn edge, continuation edge, return edge, final thread.
 [17:45] Edges: spawn, return, continuation.
 [19:40] Performance measures: running time on p processors T_{p}, work (serial time) T_{1}, critcial path length (longest path) T_{infinity}.
 [24:15] Lower bounds on T_{p} (running time on p processors).
 [29:35] Speedup, linear speedup, superlinear speedup.
 [32:55] Maximum possible speedup.
 [36:40] Scheduling.
 [38:55] Greedy scheduler.
 [43:40] Theorem by Grant and Brent (bound on T_{p} for greedy scheduler).
 [46:20] Proof of Grant, Brent theorem.
 [56:50] Corollary on greedy scheduler's linear speedup.
 [01:02:20] Cilk programming language.
 [01:06:00] Chess programs: *Socrates, Clikchess.
 [01:06:30] Analysis of *Socrates running time on 512 processors.
Lecture twenty notes:
Lecture 21: Parallel Algorithms II
Lecture twentyone is all about real multithreaded algorithms and their analysis.
It starts with a hugly important problem of multithreaded matrix multiplication: given two nxn matrices A and B, write a multithreaded algorithm MatrixMultiply that multiplies them, producing matrix C. The natural approach is to use divide and conquer method from lecture three, and formulate the problem as follows:
Each nxn matrix can be expressed as 8 multiplications and 4 additions of (n/2)x(n/2) submatrices. So the first thing we need is MatrixAdd algorithm that adds two matrices in parallel.
Here is the parallel MatrixAdd algorithm:
MatrixAdd(C, T, n): // Adds matrices C and T inplace, producing C = C + T // n is power of 2 (for simplicity). if n == 1: C[1, 1] = C[1, 1] + T[1, 1] else: partition C and T into (n/2)x(n/2) submatrices spawn MatrixAdd(C_{11}, T_{11}, n/2) spawn MatrixAdd(C_{12}, T_{12}, n/2) spawn MatrixAdd(C_{21}, T_{21}, n/2) spawn MatrixAdd(C_{22}, T_{22}, n/2) sync
The base case for this algorithm is when matrices are just scalars, then it just adds the only element of both matrices. In all other cases, it partitions matrices C and T into (n/2)x(n/2) matrices and spawns four subproblems.
Now we can implement the MatrixMultiply algorithm, by doing 8 multiplications in parallel and then calling MatrixAdd to do 4 additions.
Here is the parallel MatrixMultiply algorithm:
MatrixMultiply(C, A, B, n): // Multiplies matrices A and B, storing the result in C. // n is power of 2 (for simplicity). if n == 1: C[1, 1] = A[1, 1] · B[1, 1] else: allocate a temporary matrix T[1...n, 1...n] partition A, B, C, and T into (n/2)x(n/2) submatrices spawn MatrixMultiply(C_{12},A_{11},B_{12}, n/2) spawn MatrixMultiply(C_{21},A_{21},B_{11}, n/2) spawn MatrixMultiply(C_{22},A_{21},B_{12}, n/2) spawn MatrixMultiply(T_{11},A_{12},B_{21}, n/2) spawn MatrixMultiply(T_{12},A_{12},B_{22}, n/2) spawn MatrixMultiply(T_{21},A_{22},B_{21}, n/2) spawn MatrixMultiply(T_{22},A_{22},B_{22}, n/2) sync MatrixAdd(C, T, n)
The lecture then proceeds to calculating how good the algorithm is, turns out the parallelism (ratio of time of algorithm running on a single processor T_{1} to running on infinite number of processors T_{inf}) for doing 1000x1000 matrix multiplication is 10^{7}, which the parallel algorithm is fast as hell.
MatrixMultiply used a temporary matrix which was actually unnecessary. To achieve higher performance, it's often advantageous for an algorithm to use less space, because more space means more time. Therefore a faster algorithm can be added that multiplies matrices and adds them inplace. See lecture at 33:05 for MatrixMultiplyAdd algorithm. The parallelism of this algorithm for 1000x1000 matrix multiplication is 10^{6}, smaller than the previous one, but due to fact that no temporary matrices had to be used, it beats MatrixMultiply algorithm in practice.
The lecture then proceeds to parallel sorting. The classical sorting algorithms were covered in lectures four and five. This lecture parallelizes the MergeSort algorithm.
Here is the parallel mergesort algorithm:
ParallelMergeSort(A, p, r): // sort A[p...r] if p < r: q = floor((p+r)/2) spawn ParallelMergeSort(A, p, q) spawn ParallelMergeSort(A, q+1, r) sync Merge(A, p, q, r) // merge A[p...q] with A[q+1...r]
Instead of recursively calling ParallelMergeSort for subproblems, we spawn them, thus utilizing the power of parallelism. After doing analysis turns out that due to fact that we use the classical Merge() function, the parallelism is just lg(n). That's puny!
To solve this problem, a ParallelMerge algorithm has to be written. The lecture continues with presenting a ParallelMerge algorithm, and after that does the analysis of the same ParallelMergeSort algorithm but with ParallelMerge function. Turns out this algorithm has parallelism of n/lg^{2}(n), that is much better. See lecture at 51:00 for ParallelMerge algorithm.
The best parallel sorting algorithm know to date has the parallelism of n/lg(n).
You're welcome to watch lecture twentyone:
Topics covered in lecture twentyone:
 [00:35] Multithreaded algorithms.
 [01:32] Multithreaded matrix multiplication.
 [02:05] Parallelizing with divide and conquer.
 [04:30] Algorithm MatrixMultiply, that computes C = A*B.
 [10:13] Algorithm MatrixAdd, that computes C = C+T.
 [12:45] Analysis of running time of MatrixMultiply and MatrixAdd.
 [19:40] Analysis of critical path length of MatrixMultiply and MatrixAdd.
 [26:10] Parallelism of MatrixMultiply and MatrixAdd.
 [33:05] Algorithm MatrixMultiplyAdd, that computes C = C + A*B.
 [35:50] Analysis of MatrixMultiplyAdd.
 [42:30] Multithreaded, parallel sorting.
 [43:30] Parallelized MergeSort algorithm.
 [45:55] Analysis of multithreaded MergeSort.
 [51:00] ParallelMerge algorithm.
 [01:00:30] Analysis of ParallelMerge algorithm.
Lecture twentyone notes:


Have fun with parallel, multithreaded algorithms! The next post is going to be on cache oblivious algorithms! They are a class of algorithms that exploit all the various caches in modern computer architecture to make them run notably faster.
Comments
Interesting post and an informative writeup. Thanks!
I just want you to know how much I appreciate the amount of effort you've put into this. In particular, I like the way you made your notes and put them up. I'm currently going through the old MIT SICP videos myself.
Excellent series, I've already watched all of the previous parts.
Thanks and keep up!
very nice post made me refresh and it all came back to me now
excelent work
keep up, it's very usefull!
Thanks buddy! It was very helpful...
Parallel MatrixMultiply algorithm misses line:
spawn MatrixMultiply(C11,A11,B11, n/2)
indeed its a wonderful site for informatiion technology addicts.
In the handout (lecture 21, page 2 of 2), the critical path length for mergesort with plain merge (not parallel merge) was given as T_inf(n) = T_inf(n/2) + Theta(1) = Theta(n).
Shouldn't this be T_inf(n) = T_inf(n/2) + Theta(n) = Theta(n)?
Checked the video. The notes were wrong. The video was right with
T_inf(n) = T_inf(n/2) + Theta(n) = Theta(n).
Otherwise the result would have been Theta(log n).
The notesu've prepared n lecture was so nice.
Thanks for your efforts.
Actually I'm preparing for yahoo/microsoft/google interview.
It is very helpful.
Thanks again.
Request all to send related interview questions if possible..
Nice post buddy... came here after searching for lectures 20,21.
BTW could you reupload the lectures ?
(They seem to be missing and the MIT OCW site doesn't have them either)
Leave a new comment