|
Note:
The instructions on this page may be out of date. I haven't used MPI
on these machines in some time. [MSW 2/7/07]
The
networked Linux workstations in the School of Physics (this link is
accessible only within the SoP) can be used by members of the School as a
distributed memory parallel computer (a cluster) by following the steps
below.
Please note that you should be mindful of the needs of others when you use
computers in this way. You should only use CPUs or computers which are
not being used already, you should consider your memory and network
requirements, and you should "nice" your processes (as explained
in Step 8). If you have a large scale job, use
Myrmidon.
-
An implementation of the Message Parsing Interface (MPI), namely
LAM/MPI, is available on all
networked Linux workstations. First put the MPI binaries in your path,
and set the "LAMHOME" environment variable, e.g. by putting the lines
set path=(/usr/physics/mpi/bin $path)
setenv LAMHOME /usr/physics/mpi/
in your "~/.cshrc" file (assuming a C shell). You also need to make
accessible in your path the Intel Fortran 90/95 compiler ifc - see
/usr/physics/intel/readme. Basically you need to do the following.
source /usr/physics/intel/intel.csh
-
Write a Fortran MPI code and compile it using the compiler "hf77",
which is a wrapper to the Intel Fortran compiler (ifc).
The advantage of using the wrapper is that it specifies the locations
of the MPI library and include files, so that you can compile in a
simple way, e.g. using
% hf77 -O3 -o code code.f
where "code.f" is your Fortran 90 MPI code. Note that despite the name of the
wrapper, the underlying compiler ifc is a Fortran 90/95 compiler. If your
code is in F90 free source form you should e.g. use
% hf77 -FR -O3 -o code code.f
and you can pass any of the other ifc compiler flags.
You can also use the current setup of LAM/MPI with C codes using the
wrapper "hcc", e.g. using
% hcc -O3 -o code code.c
assuming "code.c" is your MPI C code. The underlying compiler in this
case is gcc.
If you need an introduction to MPI, the Users' Guides in C and Fortran
available at
this location are OK. If you are a student, consider enrolling in
the second semester unit
COSC 3012/3912, Parallel Computing and Visualisation. (If you are a PhD
student in the SoP, note that this course could be one of the two courses
you are required to complete.)
-
Put the compiled code somewhere that is visible from all of the Linux
workstations you intend to use, for example a directory in your home
directory.
-
Set up password-less ssh to the networked Linux workstations.
This is done by first generating an authentication key on the host
you intend to to run the code from:
% ssh-keygen -t rsa
During this procedure you will be prompted for a passphrase - just hit
return at that step (which corresponds to having no passphrase). After
you have done this you need to ssh the contents of the file
"~/.ssh/id_rsa.pub" created by the previous command to one of the networked
Linux machines you wish to use. For example, execute:
% cat ~/.ssh/id_rsa.pub | ssh user@machine2 'cat >> .ssh/authorized_keys'
where "user" is your username, and "machine2" is one of the other
workstations. If the file "~/.ssh/authorized_keys" doesn't exist, you
will need to create it.
You should then try connecting via ssh to each of the
workstations you wish to use. The first time you ssh, you will be asked if
you want to continue connecting. Type "yes" at this prompt. Subsequent ssh
connections should occur without this step (and without requiring a
passwordz).
-
Write a "hostfile" containing the list of machines you intend to use. An
example file might contain the following lines.
machine1.physics.usyd.edu.au cpu=2 user=wheat
machine2.physics.usyd.edu.au cpu=2 user=wheat
machine3.physics.usyd.edu.au cpu=2 user=wheat
The computer machine1.physics.usyd.edu.au
should be the host you intend to launch the code from (the computer
you are logged into). This file specifies that the code should also
be run on the nodes "machine2" and "machine3". Additionally, it is
specified that each computer has two processors.
It is possible that you may need to replace the machine names
by their IP addresses. For example,
machine1.physics.usyd.edu.au may need
to be replaced by its IP address, say
129.78.129.120. Note that IP addresses can
be obtained at the command line using nslookup.
-
Tell LAM/MPI to use ssh as the method of executing commands remotely,
and test the possibility of booting LAM/MPI:
% setenv LAMRSH "ssh -x"
% recon -d hostfile
(this assumes the C shell). [I think the ssh option is now set
automatically, so the first may not be necessary.] If the last step fails
with a message about not being able to find the hosts in the hostfile,
try replacing the hostnames by their corresponding IP addresses, as
explained above.
-
Boot LAM/MPI:
% lamboot -v hostfile
[This may require a few tries if password-less ssh has just been set up
(Step 2).]
You can check that you have the expected set of nodes by typing
% lamnodes
-
Run your code with the desired number of processes (which may be more,
equal or less than the number of processors on all machines).
For example, for six processes:
% mpirun -np 6 nice 10 code
where the name of the executable is "code". Note that this example follows
the recommendation that distributed jobs be niced to +10 or higher.
|