|
|
|
![]() |
MPICH-VMI2 Teragrid User Manual
Avneesh Pant( apant@ncsa.uiuc.edu)
MPICH-VMI2 is available for general use on the mercury cluster. MPICH-VMI2 has been compiled with both the GNU gcc compiler and the intel compiler. It is recommended that users use the intel compiler since it offers significant improvemenets over the GNU compiler on the Itanium platform.
Softenv keys are defined for using MPICH-VMI2. You can either use softenv to modify your environment or manually include the binary installation paths in your environment as shown below:
Using softenv execute the following:
MPICH-VMI2 is installed under /opt directory on the tg-login1 and tg-login2 head nodes. Please include them at the head of your path manually. The following example is for bash shells.
GCC: export PATH=/opt/mpich-vmi-2.0b-3-gcc/bin:$PATH
Intel: export PATH=/opt/mpich-vmi-2.0b-3-intel/bin:$PATH
MPICH-VMI2 has not been currently installed by the SDSC system administrators. In the meantime a binary installation of MPICH-VMI2 compiled with the intel compiler is available under /users/ux453899/install/mpich-vmi-2.0b-3-intel/bin. On SDSC systems this path must be included in your environment before compiling/running. We plan to have MPICH-VMI2 deployed and available in a public space on the SDSC cluster shortly as we move ahead with the CTSS version 2 rollout on the teragrid.
MPICH-VMI2 provides standard mpi wrapper scripts (mpicc, mpiCC, mpif77 and mpif90) for compiling. Only the intel compiler can be used to build codes required Fortran 90 support. The mpi wrapper scripts take standard compiler arguments. Please make sure that you have added the path to the MPICH-VMI2 compiler scripts in your environment since multiple instances of MPI are available on the teragrid. Executing a `which mpicc` should show the path to the MPI compiler script that will be used.
Example: To compile a simple MPI C program the following is
sufficient
The mpirun wrapper script is available for job execution. Standard
mpirun arguments are sufficient for single site job executions. One of
the salient features of MPICH-VMI2 is it's ability to abstract the
underlying communication network from the MPI application without loss
of performance. The application does NOT
need to be recompiled in order to execute over different underlying
networks. The default network used on the teragrid is myrinet that
utilized the GM library for inter node communication. In addition we
support execution on traditional ethernet networks utilizing the TCP
protocol. The user can select the underlying network to use at runtime
by specifying a switch to the mpirun command. MPICH-VMI2 has additional
implementation specific switches that can be passed to mpirun to aid in
performance optimization and job profiling. A complete list of these
switches and their defaults is available here
or by running the mpirun command with the help option (mpirun --help).
A grid job in MPICH-VMI2 consists of multiple subjob jobs. Each subjob consists of a collection of processes that are part of the subjob. In the context of the Teragrid a cross site run with MPICH-VMI2 consists of atleast two subjobs representing the collection of processes running on a site. For example, a job spanning NCSA, SDSC and Argonne will be atleast three subjobs to represent the collection of processes at each site. MPICH-VMI2 has the ability to generate the topology of the MPI job at runtime without requiring any external user input. The job topology is used within the MPICH-VMI2 stack at runtime to optimize communication patterns (such as MPI collectives) for execution in a grid environment. This is seamless and transparent to the user and does not require special compilation of the user codes. Thus the same executable compiled to run within a site over either myrinet or ethernet network can be used to run across the teragrid efficiently. This ability gives MPICH-VMI2 an unprcedented capability to scale out from traditional beowulf clusters in a machine room to a grid environment without the user needing to recompile applications for each environment using a different flavour of MPI.
In order for the user to submit a MPICH-VMI2 job on the Teragrid,
the user invokes individual instances of their application at each site.
The user specifies using VMI specific switches to the mpirun launcher
the number of processes to launch at that site i.e. the number of
processes for that subjob and
the total number of processes
that will be part of the computations across all subjobs. MPICH-VMI2 uses an
external server called the Grid CRM for
these subjobs to synchronize
with each other on job startup and generate the runtime topology. The Grid CRM is not
required for a single site startup. We have a GRID CRM running
at NCSA, on tg-master.ncsa.teragrid.org,
for jobs launched across the Teragrid. In order to scale to a large
number of nodes and sites on the Teragrid, individual sites and even
users can run their own Grid CRM for
purpose of job synchronization. The user specifies using a switch to
the mpirun launcher the location of the Grid CRM to use
for job synchronization. The Grid CRM can be
run on any machine reachable from the Teragrid compute nodes and does
not demand a high throughput connection to it. In addtion since it's
only used during job startup it can be shutdown once the job has started
running. A Grid CRM,
such as the default one running at NCSA can synchronize multiple jobs
concurrently. Each MPICH-VMI2 job defines a unique key, an alphanumeric string, that
uniquely identifies the grid job.
Each subjob that is part of a
larger grid job should specify
the same key. The user specifies the key
to be used for job synchronization via a mpirun switch.
In order to provide for efficient execution of codes on the
Teragrid, all communication within a site (not necessarily a subjob, since one may launch
multiple subjobs i,e mpirun
commands at the same site) use the myrinet network for communication.
Intersite communication however uses the TCP protocol to exchange
messages. This requires both the myrinet and the TCP communication
modules be active at the same time
for a grid job. This is
specified using the specfile
argument to mpirun to use the xsite-myrinet-tcp
network.