HOME PAGE FOR VMI21

[NCSA]
Virtual Machine Interface 2.1

MPICH-VMI2 Teragrid User Manual
Avneesh Pant( apant@ncsa.uiuc.edu)

Table of Contents

  1. MPICH-VMI2 Installation Locations

    1. NCSA

    2. SDSC

  2. Compiling with MPICH-VMI2

  3. Running with MPICH-VMI2 at a single site

    1. NCSA

    2. SDSC

  4. Running on the Teragrid with MPICH-VMI2

  5. Submitting questions and bug report

MPICH-VMI2 Installation Instructions

NCSA:

MPICH-VMI2 is available for general use on the mercury cluster. MPICH-VMI2 has been compiled with both the GNU gcc compiler and the intel compiler. It is recommended that users use the intel compiler since it offers significant improvemenets over the GNU compiler on the Itanium platform.

Softenv keys are defined for using MPICH-VMI2. You can either use softenv to modify your environment or manually include the binary installation paths in your environment as shown below:

Using softenv execute the following:

MPICH-VMI2 is installed under /opt directory on the tg-login1 and tg-login2 head nodes. Please include them at the head of your path manually. The following example is for bash shells.

SDSC:

MPICH-VMI2 has not been currently installed by the SDSC system administrators. In the meantime a binary installation of MPICH-VMI2 compiled with the intel compiler is available under /users/ux453899/install/mpich-vmi-2.0b-3-intel/bin. On SDSC systems this path must be included in your environment before compiling/running. We plan to have MPICH-VMI2 deployed and available in a public space on the SDSC cluster shortly as we move ahead with the CTSS version 2 rollout on the teragrid.


Compiling with MPICH-VMI2

MPICH-VMI2 provides standard mpi wrapper scripts (mpicc, mpiCC, mpif77 and mpif90) for compiling. Only the intel compiler can be used to build codes required Fortran 90 support. The mpi wrapper scripts take standard compiler arguments. Please make sure that you have added the path to the MPICH-VMI2 compiler scripts in your environment since multiple instances of MPI are available on the teragrid. Executing a `which mpicc` should show the path to the MPI compiler script that will be used.

Example: To compile a simple MPI C program the following is sufficient

mpicc -O hello_mpi.c -o hello_mpi

Running with MPICH-VMI2 at a single site

NCSA

The mpirun wrapper script is available for job execution. Standard mpirun arguments are sufficient for single site job executions. One of the salient features of MPICH-VMI2 is it's ability to abstract the underlying communication network from the MPI application without loss of performance. The application does NOT need to be recompiled in order to execute over different underlying networks. The default network used on the teragrid is myrinet that utilized the GM library for inter node communication. In addition we support execution on traditional ethernet networks utilizing the TCP protocol. The user can select the underlying network to use at runtime by specifying a switch to the mpirun command. MPICH-VMI2 has additional implementation specific switches that can be passed to mpirun to aid in performance optimization and job profiling. A complete list of these switches and their defaults is available here or by running the mpirun command with the help option (mpirun --help).

For example to run the hello_mpi code on 32 processors at NCSA utilizing the myrinet network the following mpirun command should suffice:
mpirun -np 32 -specfile myrinet hello_mpi
In order to run the same code while using the ethernet network WITHOUT requiring the code to be recompiled
mpirun -np 32 -specfile tcp hello_mpi

SDSC

Unfortunately since MPICH-VMI2 is not installed in a public location yet at SDSC running jobs with MPICH-VMI2 is a two step process. MPICH-VMI2 requires that a daemon called vmieyes be running on each node running a MPI process. These daemons are running by default on all the compute nodes at NCSA however need to be launched manually at SDSC. This restriction will be removed as we move forward on deploying MPICH-VMI2 across the Teragrid. The standard mpirun command can be used to launch the vmieyes daemon before running the MPI code. For example the following PBS job submission script should work at SDSC.

#!/bin/bash
# PBS -l walltime=10:00, nodes=8:ppn=2
#PBS -N MPICH-VMI2

# How many procs do I have?
NP=$(wc -l $PBS_NODEFILE | awk `{print $`}')

uniq $PBS_NODEFILE > machinefile.uniq

# Add MPICH-VMI2 into path
export PATH=/users/ux453899/install/mpich-vmi-2.0b-3-intel:$PATH

# First launch the vmieyes daemons on the machines
mpirun -np $NP -machinefile machinefile.uniq /users/ux453899/install/vmi-2.0b-3-gcc/sbin/vmieyes --reaper=localhost

# Finally launch the application using myrinet for communication
mpirun -np $NP -specfile myrinet hello_mpi
# Alternatively to use the ethernet network using TCP uncomment the following line
# mpirun -np $NP -specfile tcp hello_mpi

# Remove the unique machinefiles
rm machinefile.uniq
# Finally cleanup the temporary vmieyes database files
rm vmieyes-*.db


Running on the Teragrid with MPICH-VMI2

A grid job in MPICH-VMI2 consists of multiple subjob jobs. Each subjob consists of a collection of processes that are part of the subjob. In the context of the Teragrid a cross site run with MPICH-VMI2 consists of atleast two subjobs representing the collection of processes running on a site. For example, a job spanning NCSA, SDSC and Argonne will be atleast three subjobs to represent the collection of processes at each site. MPICH-VMI2 has the ability to generate the topology of the MPI job at runtime without requiring any external user input. The job topology is used within the MPICH-VMI2 stack at runtime to optimize communication patterns (such as MPI collectives) for execution in a grid environment. This is seamless and transparent to the user and does not require special compilation of the user codes. Thus the same executable compiled to run within a site over either myrinet or ethernet network can be used to run across the teragrid efficiently. This ability gives MPICH-VMI2 an unprcedented capability to scale out from traditional beowulf clusters in a machine room to a grid environment without the user needing to recompile applications for each environment using a different flavour of MPI.

In order for the user to submit a MPICH-VMI2 job on the Teragrid, the user invokes individual instances of their application at each site. The user specifies using VMI specific switches to the mpirun launcher the number of processes to launch at that site i.e. the number of processes for that subjob and the total number of processes that will be part of the computations across all subjobs. MPICH-VMI2 uses an external server called the Grid CRM for these subjobs to synchronize with each other on job startup and generate the runtime topology. The Grid CRM is not required for a single site startup. We have a GRID CRM running at NCSA, on tg-master.ncsa.teragrid.org, for jobs launched across the Teragrid. In order to scale to a large number of nodes and sites on the Teragrid, individual sites and even users can run their own Grid CRM for purpose of job synchronization. The user specifies using a switch to the mpirun launcher the location of the Grid CRM to use for job synchronization. The Grid CRM can be run on any machine reachable from the Teragrid compute nodes and does not demand a high throughput connection to it. In addtion since it's only used during job startup it can be shutdown once the job has started running. A Grid CRM, such as the default one running at NCSA can synchronize multiple jobs concurrently. Each MPICH-VMI2 job defines a unique key, an alphanumeric string, that uniquely identifies the grid job. Each subjob that is part of a larger grid job should specify the same key. The user specifies the key to be used for job synchronization via a mpirun switch.

In order to provide for efficient execution of codes on the Teragrid, all communication within a site (not necessarily a subjob, since one may launch multiple subjobs i,e mpirun commands at the same site) use the myrinet network for communication. Intersite communication however uses the TCP protocol to exchange messages. This requires both the myrinet and the TCP communication modules be active at the same time for a grid job. This is specified using the specfile argument to mpirun to use the xsite-myrinet-tcp network.


For example to run the hello_mpi program compiled previously across the Teragrid, utilizing 32 processors at NCSA and 16 processors at SDSC the following mpirun invocations at each site should suffice:
Please note that all grid related switches (specfile, grid-procs, grid-crm and key) are the same at each site! In the near future we plan to have support for launching MPICH-VMI2 jobs using the globus jobmanager and RSL scripts.

Questions

Please send all questions, bug reports and comments to Avneesh Pant (apant@ncsa.uiuc.edu)