Parallel Computing

class: center, middle, title-slide-cs49365

## CSci 493.65/795.24 Parallel Computing

<br>

## Chapter 4: Parallel Programming in Open MPI

.author[
Stewart Weiss<br>
]

.license[
Copyright 2021-24 Stewart Weiss. Unless noted otherwise all content is released under a
[Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).
Background image: George Washington Bridge at Dusk, by Stewart Weiss.
]

---
name: cc-notice
template: default
layout: true

.bottom-left[© Stewart Weiss. CC-BY-SA.]

---
name: tinted-slide
template: cc-notice
layout: true
class: tinted

---
name: contents
### Preview

We introduce the following parts of MPI in this chapter.

---
name: mpi-model
### The Message Passing Model

MPI's underlying hardware model:
- It acts as if there are independent processors with their own local memories,
connected by an interconnection network, so that any processor is connected
directly to every other processor.

.center[
<img src="figures/mpi-model.png" width=50%>
]

---
name: mpi-paradigm
### The Program Paradigm

- In MPI 1.0  new processes could not be created dynamically,
but in MPI 2 it is possible.

- MPI uses the .redbold[Single Program Multiple Data] (.redbold[SPMD]) model:
all processes execute the same program, but each has a __unique identification number__.

- To make processes do different things, they can branch using decisions based
on their identification number.

```C
  if ( my_id == n)
      f();
  else
      g();
  ```

---
name: blocking
### Blocking and Non-Blocking Communication

- Some message-passing primitives are .redbold[blocking] and some are .redbold[non-blocking].

- A communication function is .redbold[blocking] if the executing process is unable to continue
execution until the called communication function has completed. It is
.redbold[non-blocking] if it can proceed even if the function has not completed its operation
(such as delivering a message to the destination or receiving a message from a source.)

- Blocking primitives are also called .redbold[synchronous] primitives; non-blocking are
called .redbold[asynchronous] primitives.

---
name: synchronizing
### Message-Passing for Synchronization

- Messages are used not just to exchange data between processes,
but to .redbold[synchronize] as well.

- When a process calls a blocking `receive`, it does not return from the call
until some other process has sent a message to it.

- A message can be empty.

- Two processes can synchronize their execution by sending empty messages to each
other using blocking receive operations.

- MPI also has special synchronization operations.

.redbold[ It is best to limit the use of synchronization as much as possible]
because it reduces the amount of parallel activity.

---
name: overview-2
### About Message-Passing

- Accessing data from another process is slower than retrieving it from local memory because passing
messages is generally slower.
  - The local data accesses tend to result in high cache hit rates because of spatial locality,
which improves performance.

- Because communication costs reduce performance, good programmers try to make as many
local accesses as possible in the design of their programs.

---
name: debugging
### Testing and Debugging

The biggest down side to writing parallel programs is that testing and debugging
them are much, much harder than testing and debugging sequential programs.

- Repeated runs of the same program do not necessarily produce the exact same
results or the same sequence of operations in time.

- Trying to force programs to repeat what they did requires modifying code in such
a way that bugs might even be hidden.

- Message-passing program are much easier to test and debug than those using a shared
memory model, because the message-passing interface, when used properly, prevents
.redbold[race conditions].

- A .redbold[race condition] exists in a system of two or more processes
when the state of the system is dependent on the order in which these processes
execute a sequence of instructions.

---
name: tips
### Testing and Debugging Advice

- Use lots of print statements when debugging  to see what's what

- Always run with just a few processes when debugging

- Learn to use the GDB debugger

- Run on a single host to start, then on two

---
name: terminology
### MPI Terminology

Some terms to learn:

- .redbold[Communicator]: An object that makes it possible for processes to communicate
with each other using MPI message-passing primitives

- .redbold[Handle]: The unique ID of a communicator

- .redbold[Rank]:   The unique non-negative ID of a process within the communicator, 0-based

- .redbold[Buffer]: An array or string that is transferred to one or more processes
using a message-passing operation

- .redbold[Core]: An individual computing element

- .redbold[Node]: A collection of computing  elements that
 -  share the same network address,
 -  share memory, and
 -  are typically on the same motherboard

- .redbold[Hostfile]: The list of hosts on which an MPI exectuable is scheduled to run

---
name: first-prog
### A Simple MPI Program

```C
#include "mpi.h"   /* always needed in MPI programs */
#include <stdio.h>

int main(int argc, char* argv[] )
{
    int  numtasks, rank,  rc;

rc = MPI_Init(&argc,&argv); /* Initialize MPI and  save return code */

if ( rc != MPI_SUCCESS ) { /* maybe it failed so check */
        /* it failed to initialize, so abort */
        MPI_Abort(MPI_COMM_WORLD, rc);
    }

/* retrieve number of tasks to run into numtasks */
    MPI_Comm_size(MPI_COMM_WORLD,&numtasks);

/* retrieve running task's rank (id) into rank */
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);

printf ("Number of tasks = %d My rank = %d\n", numtasks, rank );

/* Before quitting, notify MPI, so it can do cleanup */
   MPI_Finalize();
}
```
---
name: first-prog-comments
### A Simple MPI Program

```C

#include "mpi.h"   /* always needed in MPI programs */
#include <stdio.h>

int main(int argc, char* argv[] )
{
    int  numtasks, rank,  rc;

rc = MPI_Init(&argc,&argv); /* Initialize MPI and  save return code */
    /* This expects the addresses of the argc and argv parameters, so you */
    /* need the address-of operator here. It initializes MPI and makes argc */
    /* and argv available to it. */

```

---
name: first-prog-comments-2
### A Simple MPI Program

```C
#include "mpi.h"   /* always needed in MPI programs */
#include <stdio.h>

int main(int argc, char* argv[] )
{
    int  numtasks, rank,  rc;

rc = MPI_Init(&argc,&argv); /* Initialize MPI and  save return code */

if ( rc != MPI_SUCCESS ) { /* maybe it failed so check */
        /* it failed to initialize, so abort */
        `MPI_Abort(MPI_COMM_WORLD, rc)`;
        /* all tasks  in MPI_COMM_WORLD are terminated, if possible */
    }
```

---
name: first-prog-3
### A Simple MPI Program

```C
#include "mpi.h"   /* always needed in MPI programs */
#include <stdio.h>

int main(int argc, char* argv[] )
{
    int  numtasks, rank,  rc;

rc = MPI_Init(&argc,&argv); /* Initialize MPI and  save return code */

if ( rc != MPI_SUCCESS ) { /* maybe it failed so check */
        /* it failed to initialize, so abort */
        MPI_Abort(MPI_COMM_WORLD, rc);
    }

/* retrieve number of tasks to run into numtasks */
    `MPI_Comm_size(MPI_COMM_WORLD,&numtasks);`
   /* You almost always need to know inside a program how many tasks are */
   /* in the communicator, i.e.,MPI_COMM_WORLD. This is how you get that
   /* number, and store it into the second parameter. */

```

---
name: first-prog-4
### A Simple MPI Program

```C
#include "mpi.h"   /* always needed in MPI programs */
#include <stdio.h>

int main(int argc, char* argv[] )
{
    int  numtasks, rank,  rc;

rc = MPI_Init(&argc,&argv); /* Initialize MPI and  save return code */

if ( rc != MPI_SUCCESS ) { /* maybe it failed so check */
        /* it failed to initialize, so abort */
        MPI_Abort(MPI_COMM_WORLD, rc);
    }

/* retrieve number of tasks to run into numtasks */
    MPI_Comm_size(MPI_COMM_WORLD,&numtasks);

/* retrieve running task's rank (id) into rank */
    `MPI_Comm_rank(MPI_COMM_WORLD,&rank);`
    /* This is how a process gets its ID, called its rank. */
    /* Almost every MPI program will need very process to know its rank, */
    /* which it can use to do task-specific computation. */

}
```

---
name: sat
### Solving A Problem: Circuit Satisfiability

We create our first MPI program to solve a relatively simple problem.

A .redbold[decision problem] is a problem that has a yes/no answer.

The .redbold[circuit satisfiability problem] is the decision problem of
determining whether a given Boolean circuit has an assignment
of its inputs that makes the output true.

- The circuit may contain any number of gates.

- If there exists an assignment of values to the inputs that makes
the output of the circuit true, it is called a .redbold[satisfiable circuit]
otherwise it is called .redbold[unsatisfiable].

.left-column[
.center[<img src="figures/circuit1.png" width=80%>
A satisfiable circuit with 4 inputs]
]
.right-column[
.center[<img src="figures/circuit2.png" width=80%>
An unsatisfiable circuit with 2 inputs]
]
.below-column[
__The Circuit Satisfiability Problem: Given a circuit, is it satisfiable?__
]

---
name: sat-2
### The Problem

An obvious sequential solution is to try all possible inputs and stop
as soon as one is found that outputs true:
```C
/* for an n-bit circuit */
Let N = 2^n
for ( int i = 0; i < N; i++)
    if the bits of i satisfy the circuit
        output “satisfiable” and stop
```

For a circuit with $n$ input bits, there are $2^n$ possible inputs to check!

This algorithm solves the decision problem,
but instead we solve a related problem:
find every input that satisfies the circuit.

```C
/* for an n-bit circuit */
Let N = 2^n
for ( int i = 0; i < N; i++)
    if the bits of i satisfy the circuit
        output i
```

---
name: parallel-sat-1
### Parallelizing the Solution: Partitioning and Communication

The obvious partition - assign one n-bit input to each primitive task.

Each task will check whether the n-bit number satisfies the circuit,
and if it does, it will print the number as a binary string.

There is NO communication between any pair of tasks except to send
output to whichever does printing.
The task graph for $n=16$ would be:
.center[
<img src="figures/circuitsat-task-channel.png" width=80%>
]

---
name: parallel-sat-2
### Parallelizing the Solution: Agglomeration and Mapping

There is a fixed number of tasks, no communication among them,
so we follow the leftmost branches of Foster's decision tree.

Do all tasks spend the same amount of time in their computations?

No.
In general, they can spend vastly different amounts of time computing.

Because tasks do not have an even computational load, we follow the right
branch of the last decision node and decide to cyclically map tasks to processors
to balance the computation load.

We assign one task to each processor, and cyclically map numbers to processors.