A DIY School of Physics Linux cluster

Note: The instructions on this page may be out of date. I haven't used MPI on these machines in some time. [MSW 2/7/07]

The networked Linux workstations in the School of Physics (this link is accessible only within the SoP) can be used by members of the School as a distributed memory parallel computer (a cluster) by following the steps below.

Please note that you should be mindful of the needs of others when you use computers in this way. You should only use CPUs or computers which are not being used already, you should consider your memory and network requirements, and you should "nice" your processes (as explained in Step 8). If you have a large scale job, use Myrmidon.

  1. An implementation of the Message Parsing Interface (MPI), namely LAM/MPI, is available on all networked Linux workstations. First put the MPI binaries in your path, and set the "LAMHOME" environment variable, e.g. by putting the lines

    set path=(/usr/physics/mpi/bin $path)
    setenv LAMHOME /usr/physics/mpi/

    in your "~/.cshrc" file (assuming a C shell). You also need to make accessible in your path the Intel Fortran 90/95 compiler ifc - see /usr/physics/intel/readme. Basically you need to do the following.

    source /usr/physics/intel/intel.csh

  2. Write a Fortran MPI code and compile it using the compiler "hf77", which is a wrapper to the Intel Fortran compiler (ifc). The advantage of using the wrapper is that it specifies the locations of the MPI library and include files, so that you can compile in a simple way, e.g. using

    % hf77 -O3 -o code code.f

    where "code.f" is your Fortran 90 MPI code. Note that despite the name of the wrapper, the underlying compiler ifc is a Fortran 90/95 compiler. If your code is in F90 free source form you should e.g. use

    % hf77 -FR -O3 -o code code.f

    and you can pass any of the other ifc compiler flags.

    You can also use the current setup of LAM/MPI with C codes using the wrapper "hcc", e.g. using

    % hcc -O3 -o code code.c

    assuming "code.c" is your MPI C code. The underlying compiler in this case is gcc.

    If you need an introduction to MPI, the Users' Guides in C and Fortran available at this location are OK. If you are a student, consider enrolling in the second semester unit COSC 3012/3912, Parallel Computing and Visualisation. (If you are a PhD student in the SoP, note that this course could be one of the two courses you are required to complete.)

  3. Put the compiled code somewhere that is visible from all of the Linux workstations you intend to use, for example a directory in your home directory.

  4. Set up password-less ssh to the networked Linux workstations. This is done by first generating an authentication key on the host you intend to to run the code from:

    % ssh-keygen -t rsa

    During this procedure you will be prompted for a passphrase - just hit return at that step (which corresponds to having no passphrase). After you have done this you need to ssh the contents of the file "~/.ssh/id_rsa.pub" created by the previous command to one of the networked Linux machines you wish to use. For example, execute:

    % cat ~/.ssh/id_rsa.pub | ssh user@machine2 'cat >> .ssh/authorized_keys'

    where "user" is your username, and "machine2" is one of the other workstations. If the file "~/.ssh/authorized_keys" doesn't exist, you will need to create it.

    You should then try connecting via ssh to each of the workstations you wish to use. The first time you ssh, you will be asked if you want to continue connecting. Type "yes" at this prompt. Subsequent ssh connections should occur without this step (and without requiring a passwordz).

  5. Write a "hostfile" containing the list of machines you intend to use. An example file might contain the following lines.

    machine1.physics.usyd.edu.au cpu=2 user=wheat
    machine2.physics.usyd.edu.au cpu=2 user=wheat
    machine3.physics.usyd.edu.au cpu=2 user=wheat

    The computer machine1.physics.usyd.edu.au should be the host you intend to launch the code from (the computer you are logged into). This file specifies that the code should also be run on the nodes "machine2" and "machine3". Additionally, it is specified that each computer has two processors.

    It is possible that you may need to replace the machine names by their IP addresses. For example, machine1.physics.usyd.edu.au may need to be replaced by its IP address, say 129.78.129.120. Note that IP addresses can be obtained at the command line using nslookup.

  6. Tell LAM/MPI to use ssh as the method of executing commands remotely, and test the possibility of booting LAM/MPI:

    % setenv LAMRSH "ssh -x"
    % recon -d hostfile

    (this assumes the C shell). [I think the ssh option is now set automatically, so the first may not be necessary.] If the last step fails with a message about not being able to find the hosts in the hostfile, try replacing the hostnames by their corresponding IP addresses, as explained above.

  7. Boot LAM/MPI:

    % lamboot -v hostfile

    [This may require a few tries if password-less ssh has just been set up (Step 2).]

    You can check that you have the expected set of nodes by typing

    % lamnodes

  8. Run your code with the desired number of processes (which may be more, equal or less than the number of processors on all machines). For example, for six processes:

    % mpirun -np 6 nice 10 code

    where the name of the executable is "code". Note that this example follows the recommendation that distributed jobs be niced to +10 or higher.


Page maintained by m.wheatland@physics.usyd.edu.au Page last updated Thursday, 02-Aug-2007 10:56:55 EST