Jump to content

Message Passing Interface: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Overview: {{fact}}
References: MPI-2 Extensions
Line 272: Line 272:
**Snir, Marc; Otto, Steve; Huss-Lederman, Steven; Walker, David; Dongarra, Jack (1995) ''[http://www.netlib.org/utk/papers/mpi-book/mpi-book.html MPI: The Complete Reference]''. MIT Press Cambridge, MA, USA.
**Snir, Marc; Otto, Steve; Huss-Lederman, Steven; Walker, David; Dongarra, Jack (1995) ''[http://www.netlib.org/utk/papers/mpi-book/mpi-book.html MPI: The Complete Reference]''. MIT Press Cambridge, MA, USA.
**M Snir, SW Otto, S Huss-Lederman, DW Walker, J (1998) ''MPI—The Complete Reference: Volume 1, The MPI Core''. MIT Press, Cambridge, MA
**M Snir, SW Otto, S Huss-Lederman, DW Walker, J (1998) ''MPI—The Complete Reference: Volume 1, The MPI Core''. MIT Press, Cambridge, MA
**W Gropp, S Huss-Lederman, A Lumsdaine, E Lusk, B (1998) ''MPI—The Complete Reference: Volume 2, The MPI-2 Extensions''. MIT Press, Cambridge, MA
**Gropp, William; Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir (1998) ''[http://mitpress.mit.edu/book-home.tcl?isbn=0262571234 MPI—The Complete Reference: Volume 2, The MPI-2 Extensions]''. MIT Press, Cambridge, MA


*Vanneschi, Marco (1999) ''Parallel paradigms for scientific computing'' In Proc. of the European School on Computational Chemistry (1999, Perugia, Italy), number 75 in ''[http://books.google.it/books?&id=zMqVdFgVnrgC Lecture Notes in Chemistry]'', pages 170–183. Springer, 2000.
*Vanneschi, Marco (1999) ''Parallel paradigms for scientific computing'' In Proc. of the European School on Computational Chemistry (1999, Perugia, Italy), number 75 in ''[http://books.google.it/books?&id=zMqVdFgVnrgC Lecture Notes in Chemistry]'', pages 170–183. Springer, 2000.

Revision as of 10:22, 20 January 2008

Message Passing Interface (MPI) is computer software that allows many computers to communicate with one another. It is used in computer clusters.

Overview

In Gropp et al 96 definition, MPI "is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation", "MPI includes point-to-point message passing and collective (global) operations, all scoped to a user-specified group of processes."[1] The MPI is a language-independent communications protocol used to program parallel computers.

MPI's goals are high performance, scalability, and portability. While it is generally considered to have been successful in meeting these goals, it has also been criticized for being too low level and difficult to use.[citation needed] Despite this complaint, it remains a crucial part of parallel programming[citation needed], since no effective alternative has come forth to take its place.

MPI is not sanctioned by any major standards body; nevertheless, it has become the de facto standard for communication among processes that model a parallel program running on a distributed memory system. Actual distributed memory supercomputers such as computer clusters often run these programs. The principal MPI-1 model has no shared memory concept, and MPI-2 has only a limited distributed shared memory concept.

Although MPI belongs in layers 5 and higher of the OSI Reference Model, implementations may cover most layers of the reference model, with socket and TCP being used in the transport layer.

Most MPI implementations consist of a specific set of routines (API) callable from Fortran, C, or C++ and from any language capable of interfacing with such routine libraries. The advantages of MPI over older message passing libraries are portability (because MPI has been implemented for almost every distributed memory architecture) and speed (because each implementation is in principle optimized for the hardware on which it runs). MPI is supported on shared-memory and NUMA (Non-Uniform Memory Access) architectures as well, where it often serves both as an important portability architecture, but also helps achieve high performance in applications that are naturally owner-computes oriented.

MPI is a specification, not an implementation. MPI has Language Independent Specifications (LIS) for the function calls and language bindings. The first MPI standard specified ANSI C and Fortran-77 language bindings together with the LIS. The draft of this standard was presented at Supercomputing 1994 (November 1994) and finalized soon thereafter. About 128 functions comprise the MPI-1.2 standard as it is now defined.

There are two versions of the standard[2] that are currently popular:[3] version 1.2 ( shortly called MPI-1 ), which emphasizes message passing and has a static runtime environment (fixed size of world), and, MPI-2.1 ( MPI-2 ) , which includes new features such as parallel I/O, dynamic process management and remote memory operations.[2] MPI-2's LIS specifies over 500 functions and provides language bindings for ANSI C, ANSI Fortran (Fortran90), and ANSI C++. Interoperability of objects defined in MPI was also added to allow for easier mixed-language message passing programming. A side effect of MPI-2 standardization (completed in 1996) was clarification of the MPI-1 standard, creating the MPI-1.2 level.

It is important to note that MPI-1.2 programs, now deemed "legacy MPI-1 programs," still work under the MPI-2 standard although some functions have been deprecated. This is important since many older programs use only the MPI-1 subset.

MPI is often compared with PVM, which is a popular distributed environment and message passing system developed in 1989, and which was one of the systems that motivated the need for standard parallel message passing systems. Most computer science students who study parallel programming are taught both Pthreads and MPI programming as complementary programming models.[citation needed]

Functionality

The MPI interface is meant to provide essential virtual topology, synchronization and communication functionality between a set of processes (that have been mapped to nodes/servers/ computer instances) in a language independent way, with language specific syntax (bindings), plus a few features that are language specific. MPI programs always work with processes, although commonly people talk about processors. When one tries to get maximum performance, one process per processor (or more recently core) is selected, as part of the mapping activity; this mapping activity happens at runtime, through the agent that starts the MPI program, normally called mpirun or mpiexec.

Such functions include, but are not limited to, point-to-point rendezvous-type send/receive operations, choosing between a Cartesian or graph-like logical process topology, exchanging data between process pairs (send/receive operations), combining partial results of computations (gathering and reduction operations), synchronizing nodes (barrier operation) as well as obtaining network-related information such as the number of processes in the computing session, current processor identity that a process is mapped to, neighboring processes accessible in a logical topology, and so on. Point-to-point operations come in synchronous, asynchronous, buffered, and ready forms, to allow both relatively stronger and weaker semantics for the synchronization aspects of a rendezvous-send. Many outstanding operations are possible in asynchronous mode, in most implementations.

MPI-1 and MPI-2 both enable implementations that do good work in overlapping communication and computation, but practice and theory differ. MPI also specifies thread safe interfaces, which have cohesion and coupling strategies that help avoid the manipulation of unsafe hidden state within the interface. As such, it is relatively easy to write multithreaded point-to-point MPI code, and some implementation support such code. Multithreaded collective communication is best accomplished by using multiple copies of Communicators, as described below.

Concepts

There are nine basic concepts of MPI, five of which are only applicable to MPI-2.

Communicator

Although MPI has many functions, there are a few concepts that are very important, and these concepts when taken a few at a time, help people learn MPI quickly, and decide what functionality to use in their application programs.

Communicators are groups of processes in the MPI session, each of which have rank order, and their own virtual communication fabric for point-to-point operations. They also have independent communication addressibility or space for collective communication. MPI also has explicit groups, but these are mainly good for organizing and reorganizing subsets of processes, before another Communicator is made. MPI understands single group Intracommunicator operations, and bi-partite (two-group) Intercommunicator communication. In MPI-1, single group operations are most prevalent, with bi-partite operations finding their biggest role in MPI-2 where their usability is expanded to include collective communication and in dynamic process management.

Communicators can be partitioned using several commands in MPI, these commands include a graph-coloring-type algorithm called MPI_COMM_SPLIT, which is commonly used to derive topological and other logical subgroupings in an efficient way.

Point-to-point basics

A number of important functions in the MPI API involve communication between two specific processes. A much used example is the MPI_Send interface, which allows one specified process to send a message to a second specified process. Point-to-point operations, as these are called, are particularly useful in master-slave program architectures, where a master node might be responsible for managing the data-flow of a collection of slave nodes. Typically, the master node will send specific batches of instructions or data to each slave node, and possibly merge results upon completion.

Collective basics

Collective functions in the MPI API involve communication between all processes in a process group (which can mean the entire process pool or a program-defined subset). A typical function is the MPI_Bcast call (short for "broadcast"). This function takes data from one specially identified node and sends that message to all processes in the process group. A reverse operation is the MPI_Reduce call, which is a function designed to take data from all processes in a group, performs a user-chosen operation (like summing), and store the results on one individual node. This type of call is also useful in master-slave architectures, where the master node may want to sum results from all slaves to arrive at a final result, for instance.

One-sided communication (MPI-2)

This section needs to be developed.

Collective extensions (MPI-2)

This section needs to be developed.

Dynamic process management (MPI-2)

The key aspect of this MPI-2 feature is "the ability of an MPI process to participate in the creation of new MPI processes or to establish communication with MPI processes that have been started separately.".[4]

MPI I/O (MPI-2)

The Parallel I/O feature introduced with MPI-2, is sometimes shortly called MPI-IO.[5]

Miscellaneous improvements of MPI-2

This section needs to be developed.

Guidelines for writing multithreaded MPI-1 and MPI-2 programs

This section needs to be developed.

Implementations

'Classical' cluster and supercomputer implementations

The implementation language for MPI is different in general from the language or languages it seeks to support at runtime. Most MPI implementations are done in a combination of C, C++ and assembly language, and target C, C++, and Fortran programmers. However, the implementation language and the end-user language are in principle always decoupled.

The initial implementation of the MPI 1.x standard was MPICH, from Argonne National Laboratory (correctly pronounced MPI-C-H, not pronounced as a single syllable) and Mississippi State University. IBM also was an early implementor of the MPI standard, and most supercomputer companies of the early 1990s either commercialized MPICH, or built their own implementation of the MPI 1.x standard. LAM/MPI from Ohio Supercomputing Center was another early open implementation. Argonne National Laboratory has continued developing MPICH for over a decade, and now offers MPICH 2, which is an implementation of the MPI-2.1 standard. LAM/MPI, and a number of other MPI efforts recently merged to form a new world-wide project, called the Open MPI implementation, but this name does not imply any connection with a special form of the standard. There are many other efforts that are derivatives of MPICH, LAM, and other works, too numerous to name here. Recently, Microsoft added an MPI effort to their Cluster Computing Kit (2005), based on MPICH 2. MPI has become and remains a vital interface for concurrent programming to this date.

Many Linux distributions include MPI (either or both MPICH and LAM, as particular examples), but it is best to get newest versions from MPI developer sites. Many vendors distribute customised versions of existing free software implementations which focus on better performance and stability.

Besides the mainstream of MPI programming for high performance, MPI has been used widely with Python, Perl, and Java. These communities are growing. MATLAB-based MPI appear in many forms, but no consensus on a single way of using MPI with MATLAB yet exists. The next sections detail some of these efforts.

Python

There are at least five known attempts to implement MPI for Python: mpi4py, PyPar, PyMPI, MYMPI, and the MPI submodule in ScientificPython. PyMPI is notable because it is a variant python interpreter making the multi-node application the interpreter itself, rather than the code the interpreter runs. PyMPI implements most of the MPI spec and automatically works with compiled code that needs to make MPI calls. PyPar, MYMPI, and ScientificPython's module all are designed to work like a typical module used with nothing but an import statement. They make it the coder's job to decide when and where the call to MPI_Init belongs.

OCaml

The OCamlMPI Module implements a large subset of MPI functions and is in active use in scientific computing. To get a sense of its maturity: it was reported on caml-list that an eleven thousand line OCaml program was "MPI-ified", using the module, with an additional 500 lines of code and slight restructuring and has run with excellent results on up to 170 nodes in a supercomputer.

Java

Although Java does not have an official MPI binding, there have been several attempts to bridge Java and MPI, with different degrees of success and compatibility. One of the first attempts was Bryan Carpenter's mpiJava, essentially a collection of JNI wrappers to a local C MPI library, resulting in a hybrid implementation with limited portability, which also has to be recompiled against the specific MPI library being used.

However, this original project also defined the mpiJava API (a de-facto MPI API for Java following the equivalent C++ bindings closely) which other subsequent Java MPI projects followed. An alternative although less used API is the MPJ API, designed to be more object-oriented and closer to Sun Microsystems' coding conventions. Other than the API used, Java MPI libraries can be either dependant on a local MPI library, or implement the message passing functions in Java, while some like P2P-MPI also provide Peer to peer functionality and allow mixed platform operation.

Some of the most challenging parts of any MPI implementation for Java arise from the language's own limitations and peculiarities, such as the lack of explicit pointers and linear memory address space for its objects , which make transferring multi-dimensional arrays and complex objects inefficient. The workarounds usually used involve transferring one line at a time and/or performing explicit de-serialization and casting both at the sending and receiving end, simulating C or FORTRAN-like arrays by the use of a one-dimensional array, and pointers to primitive types by the use of single-element arrays, thus resulting in programming styles quite extraneous from Java's conventions.

One major improvement is MPJ Express by Aamir Shafi. This project was supervised by Bryan Carpenter and Mark Baker. On commodity platform like Fast Ethernet, advances in JVM technology now enable networking programs written in Java to rival their C counterparts. On the other hand, improvements in specialized networking hardware have continued, cutting down the communication costs to a couple of microseconds. Keeping both in mind, the key issue at present is not to debate the JNI approach versus the pure Java approach, but to provide a flexible mechanism for programs to swap communication protocols. The aim of this project is to provide a reference Java messaging system based on the MPI standard. The implementation follows a layered architecture based on an idea of device drivers. The idea is analogous to UNIX device drivers. For more info visit [1]

Microsoft Windows

Windows Compute Cluster Server uses the Microsoft Messaging Passing Interface v2 (MS-MPI) to communicate between the processing nodes on the cluster network. The application programming interface consists of over 160 functions. MS MPI was designed, with some exceptions because of security considerations, to cover the complete set of MPI2 functionality as implemented in MPICH2. Dynamic process spawn and publishing are planned for the future.

There is also a completely managed .NET implementation of MPI - Pure Mpi.NET. The object-oriented API is powerful, yet easy to use for parallel programming. It has been developed based on the latest .NET technologies, including Windows Communication Foundation (WCF). This allows you to declaratively specify the binding and endpoint configuration for your environment and performance needs. When using the SDK, a programmer will definitely see the MPI'ness of the interfaces come through, while it takes full advantage of .NET features - including generics, delegates, asynchronous results, exception handling, and extensibility points.

Hardware Implementations

There has been research over time into implementing MPI directly into the hardware of the system, for example by means of Processor-in-memory, where the MPI operations are actually built into the microcircuitry of the RAM chips in each node. By implication, this type of implementation would be independent of the language, OS or CPU on the system, but cannot be readily updated or unloaded.

Another approach has been to add hardware acceleration to one or more parts of the operation. This may include hardware processing of the MPI queues or the use of RDMA to directly transfer data between memory and the network interface without needing CPU or kernel intervention.

Example program

Here is "Hello World" in MPI written in C. In this example, we send a "hello" message to each processor, manipulate it trivially, send the results back to the main process, and print the messages out.

 /*
  "Hello World" Type MPI Test Program
 */
 #include <mpi.h>
 #include <stdio.h>
 #include <string.h>
 
 #define BUFSIZE 128
 #define TAG 0
 
 int main(int argc, char *argv[])
 {
   char idstr[32];
   char buff[BUFSIZE];
   int numprocs;
   int myid;
   int i;
   MPI_Status stat; 
  
   MPI_Init(&argc,&argv); /* all MPI programs start with MPI_Init; all 'N' processes exist thereafter */
   MPI_Comm_size(MPI_COMM_WORLD,&numprocs); /* find out how big the SPMD world is */
   MPI_Comm_rank(MPI_COMM_WORLD,&myid); /* and this processes' rank is */
 
   /* At this point, all the programs are running equivalently, the rank is used to
      distinguish the roles of the programs in the SPMD model, with rank 0 often used
      specially... */
   if(myid == 0)
   {
     printf("%d: We have %d processors\n", myid, numprocs);
     for(i=1;i<numprocs;i++)
     {
       sprintf(buff, "Hello %d! ", i);
       MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);
     }
     for(i=1;i<numprocs;i++)
     {
       MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);
       printf("%d: %s\n", myid, buff);
     }
   }
   else
   {
     /* receive from rank 0: */
     MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);
     sprintf(idstr, "Processor %d ", myid);
     strcat(buff, idstr);
     strcat(buff, "reporting for duty\n");
     /* send to rank 0: */
     MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
   }
 
   MPI_Finalize(); /* MPI Programs end with MPI Finalize; this is a weak synchronization point */
   return 0;
 }

It is important to note that the runtime environment for the MPI implementation used (often called MPIRUN or MPIEXEC), spawns multiple copies of this program text, with the total number of copies determining the number of process ranks in MPI_COMM_WORLD, which is an opaque descriptor for communication between the set of processes. A Single-Program-Multiple-Data (SPMD) programming model is thereby facilitated, but not required; many MPI implementations allow multiple, different, executables to be started in the same MPI job. Each process has its own rank, the total number of processes in the world, and the ability to communicate between them either with point-to-point (send/receive) communication, or by collective communication among the group. It is enough for MPI to provide an SPMD-style program with MPI_COMM_WORLD, its own rank, and the size of the world to allow for algorithms to decide what they do based on their rank. In more robust examples, additional I/O to the real-world is needed of course. MPI does not guarantee how POSIX I/O, as used in the example, would actually work on a given system, but it commonly does work, at least from rank 0. If it does work, POSIX I/O like printf() is not particularly scalable, and should be used sparingly.

The notion of process and not processor is used in MPI, as shown below. The copies of this program are mapped to processors by the runtime environment of MPI. In that sense, the parallel machine can map to 1 physical processor, or N, where N is the total number of processors available, or something in between. For maximal potential for parallel speedup, more physical processors are used, but the ability to separate the mapping from the design of the program is an essential value for development, as well as for practical situations where resources are limited. It should also be noted that this example adjusts its behavior to the size of the world N, so it also seeks to be scalable to the size given at runtime. There is no separate compilation for each size of the concurrency, although different decisions might be taken internally depending on that absolute amount of concurrency provided to the program.

Adoption of MPI-2

While the adoption of MPI-1.2 has been universal, including on almost all cluster computing, the acceptance of MPI-2.1 has been more limited. Here are some of the reasons.

  1. While MPI-1.2 emphasizes message passing and a minimal, static runtime environment, full MPI-2 implementations include I/O and dynamic process management, and the size of the middleware implementation is substantially larger. Furthermore, most sites that use batch scheduling systems cannot support dynamic process management. Parallel I/O is well accepted as a key value of MPI-2.
  2. Many legacy MPI-1.2 programs were already developed by the time MPI-2 came out, and work fine. The threat of potentially lost portability by using MPI-2 functions kept people from using the enhanced standard for many years, though this is lessening in the mid 2000's, with wider support for MPI-2.
  3. Many MPI-1.2 applications use only a subset of that standard (16-25 functions). This minimalism of use contrasts with the huge availability of functionality now afforded in MPI-2.

Other inhibiting factors can be cited too, although these may amount more to perceptions and belief than fact. MPI-2 has been well supported in free and commercial implementations since at least the early 2000s, with some implementations coming earlier than that.

The future of MPI

Some aspects of MPI's future appear solid; others less so. The MPI Forum has been dormant for nearly a decade, but maintained its mailing list. However, in late 2006 the mailing list was revived for the purpose of clarifying MPI-2 issues, and possibly for defining a new standard level. On February 9, 2007, the "MPI-2.1" standard was kicked off, with new web presence and a new mailing list.[2] It has set its initial scope to revive errata discussions, renew membership and interest, and then explore future opportunities.

  1. MPI as a legacy interface will exist at the MPI-1.2 and MPI-2.1 levels for many years to come. Like Fortran, it is ubiquitous in technical computing, and it is taught and used widely. The body of free and commercial products that require MPI, combined with new ports of the existing free and commercial implementations to new target platforms, help ensure that MPI will go on indefinitely.
  2. Architectures are changing, with greater internal concurrency (multi-core), better fine-grain concurrency control (threading, affinity), and more levels of memory hierarchy. This has already yielded separate, complementary standards for symmetric multiprocessing, namely OpenMP. However, in the future, both massive scale and multi-granular concurrency reveal limitations of the MPI standard, which is only tangentially friendly to multithreaded programming, and does not specify enough about how multithreaded programs should be written. While multithreaded capable MPI implementations do exist, the number of multithreaded, message passing applications are few. The drive to achieve multi-level concurrency all within MPI is both a challenge and an opportunity for the standard in future.
  3. The number of functions is huge, though as noted above, the number of concepts is relatively small. However, given that many users don't use the majority of the capabilities of MPI-2, a future standard might be smaller and more focused, or have profiles to allow different users to get what they need without waiting for a complete, validated implementation suite.
  4. Grid and virtual grid computing offer MPIs a means of handling static and dynamic process management with particular 'fits'. While it is possible for force the MPI model into working on a grid, the idea of a fault-free, long-running MPI program can be problematic to satisfactorily achieve. Grids may want to instantiate MPI APIs between sets of running processes, but multi-level middleware that addresses concurrency, faults, and message traffic are needed. Fault tolerant MPIs and Grid MPIs have been attempted, but the original design of MPI itself impacts what can be done.
  5. People want a higher productivity interface. MPI programs are often referred to as assembly language of parallel programming. This goal – whether through semi-automated compilation or through model-driven architecture and component engineering, or both - means that MPI would have to evolve, and in some sense, move into the background.

These areas - some well-funded by DARPA and others; some underway in academic groups worldwide - have yet to produce a consensus that can fundamentally disrupt MPI's key values – performance and portability and ubiquitous support.

See also

Notes

  1. ^ Gropp et al 96, p.3
  2. ^ a b Gropp et al 1999-advanced, pp.4-5
  3. ^ Vanneschi '99, p. 170 quote: "all the de facto standards (such as MPI-1, MPI-2, HPF)"
  4. ^ Gropp et al 1999-advanced, p.7
  5. ^ Gropp et al 1999-advanced, pp.5-6

References

  • This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.
  • Foster, Ian (1995) Designing and Building Parallel Programs (Online) Addison-Wesley ISBN 0201575949, chapter 8 Message Passing Interface
  • Using MPI series:
    • Gropp, William; Lusk, Ewing; Skjellum, Anthony. (1994) Using MPI: portable parallel programming with the message-passing interface. Mit Press In Scientific And Engineering Computation Series, Cambridge, MA, USA. 307 pp. ISBN:0-262-57104-8
    • Gropp, William; Lusk, Ewing; Skjellum, Anthony. (1999) Using MPI, 2nd Edition: portable Parallel Programming with the Message Passing Interface. Mit Press In Scientific And Engineering Computation Series, Cambridge, MA, USA. 395 pp. ISBN-13 978-0-262-57132-6
    • Gropp, William; R Thakur, E Lusk (1999) Using MPI-2: Advanced Features of the Message Passing Interface - MIT Press Cambridge, MA, USA
  • MPI—The Complete Reference series:
    • Snir, Marc; Otto, Steve; Huss-Lederman, Steven; Walker, David; Dongarra, Jack (1995) MPI: The Complete Reference. MIT Press Cambridge, MA, USA.
    • M Snir, SW Otto, S Huss-Lederman, DW Walker, J (1998) MPI—The Complete Reference: Volume 1, The MPI Core. MIT Press, Cambridge, MA
    • Gropp, William; Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir (1998) MPI—The Complete Reference: Volume 2, The MPI-2 Extensions. MIT Press, Cambridge, MA
  • Vanneschi, Marco (1999) Parallel paradigms for scientific computing In Proc. of the European School on Computational Chemistry (1999, Perugia, Italy), number 75 in Lecture Notes in Chemistry, pages 170–183. Springer, 2000.