class: center, middle, title-slide-cs49365 ## CSci 493.65/795.24 Parallel Computing
## Chapter 4: Parallel Programming in Open MPI .author[ Stewart Weiss
] .license[ Copyright 2021-24 Stewart Weiss. Unless noted otherwise all content is released under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/). Background image: George Washington Bridge at Dusk, by Stewart Weiss. ] --- name: cc-notice template: default layout: true .bottom-left[© Stewart Weiss. CC-BY-SA.] --- name: tinted-slide template: cc-notice layout: true class: tinted --- name: contents ### Preview We introduce the following parts of MPI in this chapter. | MPI Function | Purpose | |:--- |:--- | | `MPI_Init` | initializes MPI | | `MPI_Rank` | determines a process's ID number (called its rank) | | `MPI_Comm_size` | determines the number of processes | | `MPI_Reduce` | performs a reduction operation | | `MPI_Finalize` | shuts down MPI and releases resources | | `MPI_Bcast` | performs a broadcast operation | | `MPI_Barrier` | performs a barrier synchronization operation | | `MPI_Wtime` | determines the wall time | | `MPI_Wtick` | determines the length of a clock tick | --- name: mpi-model ### The Message Passing Model MPI's underlying hardware model: - It acts as if there are independent processors with their own local memories, connected by an interconnection network, so that any processor is connected directly to every other processor. .center[
] --- name: mpi-paradigm ### The Program Paradigm - In MPI 1.0 new processes could not be created dynamically, but in MPI 2 it is possible. - MPI uses the .redbold[Single Program Multiple Data] (.redbold[SPMD]) model: all processes execute the same program, but each has a __unique identification number__. - To make processes do different things, they can branch using decisions based on their identification number. ```C if ( my_id == n) f(); else g(); ``` --- name: blocking ### Blocking and Non-Blocking Communication - Some message-passing primitives are .redbold[blocking] and some are .redbold[non-blocking]. - A communication function is .redbold[blocking] if the executing process is unable to continue execution until the called communication function has completed. It is .redbold[non-blocking] if it can proceed even if the function has not completed its operation (such as delivering a message to the destination or receiving a message from a source.) - Blocking primitives are also called .redbold[synchronous] primitives; non-blocking are called .redbold[asynchronous] primitives. --- name: synchronizing ### Message-Passing for Synchronization - Messages are used not just to exchange data between processes, but to .redbold[synchronize] as well. - When a process calls a blocking `receive`, it does not return from the call until some other process has sent a message to it. - A message can be empty. - Two processes can synchronize their execution by sending empty messages to each other using blocking receive operations. - MPI also has special synchronization operations. .redbold[ It is best to limit the use of synchronization as much as possible] because it reduces the amount of parallel activity. --- name: overview-2 ### About Message-Passing - Accessing data from another process is slower than retrieving it from local memory because passing messages is generally slower. - The local data accesses tend to result in high cache hit rates because of spatial locality, which improves performance. - Because communication costs reduce performance, good programmers try to make as many local accesses as possible in the design of their programs. --- name: debugging ### Testing and Debugging The biggest down side to writing parallel programs is that testing and debugging them are much, much harder than testing and debugging sequential programs. - Repeated runs of the same program do not necessarily produce the exact same results or the same sequence of operations in time. - Trying to force programs to repeat what they did requires modifying code in such a way that bugs might even be hidden. - Message-passing program are much easier to test and debug than those using a shared memory model, because the message-passing interface, when used properly, prevents .redbold[race conditions]. - A .redbold[race condition] exists in a system of two or more processes when the state of the system is dependent on the order in which these processes execute a sequence of instructions. --- name: tips ### Testing and Debugging Advice - Use lots of print statements when debugging to see what's what - Always run with just a few processes when debugging - Learn to use the GDB debugger - Run on a single host to start, then on two --- name: terminology ### MPI Terminology Some terms to learn: - .redbold[Communicator]: An object that makes it possible for processes to communicate with each other using MPI message-passing primitives - .redbold[Handle]: The unique ID of a communicator - .redbold[Rank]: The unique non-negative ID of a process within the communicator, 0-based - .redbold[Buffer]: An array or string that is transferred to one or more processes using a message-passing operation - .redbold[Core]: An individual computing element - .redbold[Node]: A collection of computing elements that - share the same network address, - share memory, and - are typically on the same motherboard - .redbold[Hostfile]: The list of hosts on which an MPI exectuable is scheduled to run --- name: first-prog ### A Simple MPI Program ```C #include "mpi.h" /* always needed in MPI programs */ #include
int main(int argc, char* argv[] ) { int numtasks, rank, rc; rc = MPI_Init(&argc,&argv); /* Initialize MPI and save return code */ if ( rc != MPI_SUCCESS ) { /* maybe it failed so check */ /* it failed to initialize, so abort */ MPI_Abort(MPI_COMM_WORLD, rc); } /* retrieve number of tasks to run into numtasks */ MPI_Comm_size(MPI_COMM_WORLD,&numtasks); /* retrieve running task's rank (id) into rank */ MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf ("Number of tasks = %d My rank = %d\n", numtasks, rank ); /* Before quitting, notify MPI, so it can do cleanup */ MPI_Finalize(); } ``` --- name: first-prog-comments ### A Simple MPI Program ```C #include "mpi.h" /* always needed in MPI programs */ #include
int main(int argc, char* argv[] ) { int numtasks, rank, rc; rc = MPI_Init(&argc,&argv); /* Initialize MPI and save return code */ /* This expects the addresses of the argc and argv parameters, so you */ /* need the address-of operator here. It initializes MPI and makes argc */ /* and argv available to it. */ ``` --- name: first-prog-comments-2 ### A Simple MPI Program ```C #include "mpi.h" /* always needed in MPI programs */ #include
int main(int argc, char* argv[] ) { int numtasks, rank, rc; rc = MPI_Init(&argc,&argv); /* Initialize MPI and save return code */ if ( rc != MPI_SUCCESS ) { /* maybe it failed so check */ /* it failed to initialize, so abort */ `MPI_Abort(MPI_COMM_WORLD, rc)`; /* all tasks in MPI_COMM_WORLD are terminated, if possible */ } ``` --- name: first-prog-3 ### A Simple MPI Program ```C #include "mpi.h" /* always needed in MPI programs */ #include
int main(int argc, char* argv[] ) { int numtasks, rank, rc; rc = MPI_Init(&argc,&argv); /* Initialize MPI and save return code */ if ( rc != MPI_SUCCESS ) { /* maybe it failed so check */ /* it failed to initialize, so abort */ MPI_Abort(MPI_COMM_WORLD, rc); } /* retrieve number of tasks to run into numtasks */ `MPI_Comm_size(MPI_COMM_WORLD,&numtasks);` /* You almost always need to know inside a program how many tasks are */ /* in the communicator, i.e.,MPI_COMM_WORLD. This is how you get that /* number, and store it into the second parameter. */ ``` --- name: first-prog-4 ### A Simple MPI Program ```C #include "mpi.h" /* always needed in MPI programs */ #include
int main(int argc, char* argv[] ) { int numtasks, rank, rc; rc = MPI_Init(&argc,&argv); /* Initialize MPI and save return code */ if ( rc != MPI_SUCCESS ) { /* maybe it failed so check */ /* it failed to initialize, so abort */ MPI_Abort(MPI_COMM_WORLD, rc); } /* retrieve number of tasks to run into numtasks */ MPI_Comm_size(MPI_COMM_WORLD,&numtasks); /* retrieve running task's rank (id) into rank */ `MPI_Comm_rank(MPI_COMM_WORLD,&rank);` /* This is how a process gets its ID, called its rank. */ /* Almost every MPI program will need very process to know its rank, */ /* which it can use to do task-specific computation. */ } ``` --- name: sat ### Solving A Problem: Circuit Satisfiability We create our first MPI program to solve a relatively simple problem. A .redbold[decision problem] is a problem that has a yes/no answer. The .redbold[circuit satisfiability problem] is the decision problem of determining whether a given Boolean circuit has an assignment of its inputs that makes the output true. - The circuit may contain any number of gates. - If there exists an assignment of values to the inputs that makes the output of the circuit true, it is called a .redbold[satisfiable circuit] otherwise it is called .redbold[unsatisfiable]. .left-column[ .center[
A satisfiable circuit with 4 inputs] ] .right-column[ .center[
An unsatisfiable circuit with 2 inputs] ] .below-column[ __The Circuit Satisfiability Problem: Given a circuit, is it satisfiable?__ ] --- name: sat-2 ### The Problem An obvious sequential solution is to try all possible inputs and stop as soon as one is found that outputs true: ```C /* for an n-bit circuit */ Let N = 2^n for ( int i = 0; i < N; i++) if the bits of i satisfy the circuit output “satisfiable” and stop ``` For a circuit with $n$ input bits, there are $2^n$ possible inputs to check! This algorithm solves the decision problem, but instead we solve a related problem: find every input that satisfies the circuit. ```C /* for an n-bit circuit */ Let N = 2^n for ( int i = 0; i < N; i++) if the bits of i satisfy the circuit output i ``` --- name: parallel-sat-1 ### Parallelizing the Solution: Partitioning and Communication The obvious partition - assign one n-bit input to each primitive task. Each task will check whether the n-bit number satisfies the circuit, and if it does, it will print the number as a binary string. There is NO communication between any pair of tasks except to send output to whichever does printing. The task graph for $n=16$ would be: .center[
] --- name: parallel-sat-2 ### Parallelizing the Solution: Agglomeration and Mapping There is a fixed number of tasks, no communication among them, so we follow the leftmost branches of Foster's decision tree. Do all tasks spend the same amount of time in their computations? -- No. In general, they can spend vastly different amounts of time computing. -- Because tasks do not have an even computational load, we follow the right branch of the last decision node and decide to cyclically map tasks to processors to balance the computation load. We assign one task to each processor, and cyclically map numbers to processors.