This is an exercise from the Extras part of the HPC Cloud Tutorial.
In this advanced part of our HPC Cloud tutorial we ask you to play around with a parallel processing technique on a message-passing system. For this puspose, we will be running wave simulations using MPI. We will approximate solutions for the wave differential equation in 2D, by using numerical methods.
NOTE:
You are now in the advanced section of the workshop. You have your laptop and an Internet connection. We expect you will be able to find out more on your own about things that we hardly/don’t explain but which you think you need. For example, if we were you, at this point we would’ve already googled for several things:
- Numerical methods
- Wave differential equation
- MPI cheatsheet
We provide you with an implementation of that simulation using MPI
. You will be asked to perform multiple runs of each program, so that fluctuations caused by e.g. network can be middled out.
Tip:
We recommend you have a look at the end of this page for some hints on how to help measuring time, running a program multiple times and computing an average time out of multiple measurements.
We will be creating a 2-core VM for this exercise.
App
:
app
. Make sure you select the proper datastore for your new OS image (147:Courses_img). Give the image
and template
the name: mpi_waveEdit the image
to make it persistent.
template
to make it a 2-core one:
template
and click anywhere on its row (except the checkbox) to show its extended information. Then click the Update button.nic
(hit + Add another interface button); for this second nic
, NIC 1, choose the wolk-surfsara.int network (from the table on the right of the screen)Launch a VM from that template
ssh -Y ubuntu@145.100...`
sudo add-apt-repository "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc) main universe"
sudo apt-get update
sudo apt-get install build-essential
Optionally verify gcc and GNU make installation and version with
gcc -v
andmake -v
respectively.
sudo apt-get install libhdf5-serial-dev libopenmpi-dev openmpi-bin openmpi-common hdf5-tools imagemagick gnuplot
echo 'export CPATH=/usr/include/hdf5/serial/' >> ~/.bashrc
source ~/.bashrc
wget https://raw.githubusercontent.com/sara-nl/clouddocs/gh-pages/tutorial/code/policy.xml
sudo mv policy.xml /etc/ImageMagick-6/policy.xml
wget https://github.com/sara-nl/clouddocs/raw/gh-pages/tutorial/code/waveeq2.tar.gz
tar -zxf waveeq2.tar.gz
cd waveeq/
ls -l
make wave4
The code in file wave4.c
is where the main routine is. It uses functionality from other .c
and .h
files, but the main loop is there. The main loop understands how many MPI processes must be created and it divides the space among them to distribute work.
When the program runs, it writes output to file wave4.h5
in HDF5 format.
c
is the wave propagation speedzeta
is the damping coefficientm
and n
are the spatial dimmensions; you can play with these to make a bigger/smaller surfacent
is the temporal dimmension; you can play with this to make a longer/shorter time simulationnt_i
is the checkpoint interval: output will be written to file every so many nt_i
time iterations.We invite you to explore the code to get familiar with it. If you change any of these values, please remember to compile the program again for the changes to make effect.
You can run the program in a single process with the following command:
time ./wave4
You can use the provided program h5anim
to read from the output file of the wave4
program (remember, called wave4.h5
) and create an animated wave4-anim.gif
. You do it so:
./h5anim wave4.h5 u
Then you can use any browser to view the animated gif (e.g. with the command firefox wave4-anim.gif
), or install the program gifview
(with sudo apt-get install gifsicle
) and see it so: gifview --animate wave4-anim.gif
.
Food for brain c1:
- Can you make a batch of several runs and observe the performance?
We will want to show how to scale out later, and that will involve multiple VMs, as explained during the presentation. In order for multiple VMs to be able to run out of the same image
, this must be non-persistent (as explained in Part B).
ssh-keygen -t rsa -f /home/ubuntu/.ssh/id_rsa
#Enter passphrase (empty for no passphrase): <<<<leave this empty (hit enter twice)>>>>
cat /home/ubuntu/.ssh/id_rsa.pub >> /home/ubuntu/.ssh/authorized_keys
Previously we made the image
persistent so that when shutting the VM down, changes would be saved and kept for the next run. But changes are only saved when you actually shut the VM down gracefully.
image
and switch the value for field Persistent to no.IMPORTANT
Now that the
image
is non-persistent, no changes will be saved when you shut down a VM using it. If you require so at some point, you will have to make it persistent again first!
Because the program is ready for MPI, you can use mpirun
to use multiple cores.
Exercse e1: Try now the mpi run with 2 cores, like this:
template
, giving it the name e.g.mpi_wave_master-X
parameter for the ssh
command)waveeq
directory: cd waveeq
You can now run the program with 2 processes like:
time mpirun -n 2 ./wave4
Food for brain e1:
- Is the output image from this multi-process the same as the single-core one?
- How many processes are running? (hint: use the
top
command on a different terminal)- Do you see any significant time improvement as compared to running it with one process? Can you explain the improvement (or lack thereof)?
Exercise e2: You can try to run now the program with more processes. For example, with 4:
time mpirun -n 4 ./wave4
Food for brain e2:
- Is the output image from this multi-process the same as that from previous runs?
- Can you make a batch of several runs (e.g.: 20) and calculate the average runtime and standard deviation?
- How many processes are running? (hint: use the
top
command on a different terminal)- Do you see any significant time improvement as compared to the previous runs? Can you explain the improvement (or lack thereof)?
Exercise e3: Try now running the program with a couple more configurations, like 6 processors or 8. Any improvement in time?
MPI is able to communicate within processes that may physically by running on different (virtual) machines. We are going to make this happen now.
There is a fair amount of configuration that needs to happen among all the machines involved in cooperating for running MPI jobs. We have prepared a couple of scripts that you can use for that.
A typical way of considering a cluster is to have a master node (or host) where you can externally log into and launch the programs, along with a set of worker nodes that the master knows about and where it delegates computing workload.
In our exercises, we will consider one master and one worker node. Both will compute (so the master will not be just a passive node, but it will also contribute to the output). We will not consider any job-submitting queues, but rather, we will let MPI communicate over SSH. For that, both the master and the worker need to be able to SSH to each-other without requiring a password (a.k.a. passwordless ssh). We provide you with a script that you can run in each of the machines (in this case just one master an one worker) to do this interactively in a, hopefully, easy way.
To make it easier, all nodes where MPI will run a program must have that program installed the same way in the same path. Because we (you) have been carefully building the image
so far, that is already arranged.
Also, usually the worker nodes are protected (inaccessible) from the outside world, so you can only reach them normally from within the internal network. We will simulate this as well. We have provided a script to configure the master and another for the worker node, which will change the hostname and also bring down the external network interface on the worker node.
Exercise f1: Launch another VM that will become a worker
template
with name e.g.mpi_wave_workerExercise f2: Configure the master node
Make sure you have an SSH connection to the VM we have been playing with so far (so, not the one you just created). This will become the master of our mpi cluster from now on.
Write down the internal ip address of the master node.
Run our configuration script to turn the “mpi_wave_master” VM into the master:
cd
wget https://github.com/sara-nl/clouddocs/raw/gh-pages/tutorial/code/makeme_master.sh
chmod +x makeme_master.sh
sudo ./makeme_master.sh
exit
Food for brain f2:
- What has just happened?
Exercise f3: Configure the worker node
Let’s start by giving root
a password on the worker node, so that we can use the VNC console.
Login to the worker node “mpi_wave_worker” VM. Remember to use the -X
parameter for the ssh
command
Run the commands to give a root password (type the passwords when you are asked to):
sudo su -
passwd
exit
IMPORTANT!
- Do not go any further until you are sure that you can log in as root in the worker node via the VNC console. In the Desktop version select “Other” and enter “root” as username and the password you have just set. Ignore the error in initial screen (hit OK).
cd
wget https://github.com/sara-nl/clouddocs/raw/gh-pages/tutorial/code/makeme_worker.sh
chmod +x makeme_worker.sh
sudo ./makeme_worker.sh 1 XXX.YYY.ZZZ.TTT #replace XXX.YYY.ZZZ.TTT with the INTERNAL IP address of the master
#hit ENTER when prompted and wait a bit..
Food for brain f3:
- Why do we recommend you to use the VNC console on this VM?
- What has just happened!? Why do you need to become root? Why does the script require those parameters?
The Apps that we deliver come with a firewall running on the operating system, called Uncomplicated Firewall (or ufw, in short). MPI needs to communicate through the network between master and worker. They are both running a firewall. To avoid problems and because this is just a test scenario, we will trust all traffic coming from our internal interfaces.
sudo ufw allow in on eth1 && sudo service ufw restart
sudo ufw allow in on eth1 && sudo service ufw restart
Exercise f4: Run over 2 VMs
time mpirun -np 4 -H <master_INTERNAL_ip>,<worker_INTERNAL_ip> /home/ubuntu/waveeq/wave4
Food for brain f4:
- Is the output image from this multi-process the same as that from previous runs?
- How many processes are running? (hint: use the
top
command on different terminals)- Do you see any significant time improvement as compared to the previous runs? Can you explain the improvement (or lack thereof)?
This section is meant as extra questions that we thought would be nice for you to investigate, and we invite you to do/think about them even after the workshop is finished.
Bonus 1: Can you make a batch of several runs (e.g.: 20) and calculate the average runtime and standard deviation?
Bonus 2: What do you need to do to make more workers available? Is our image enough? Go ahead: try to have 2 more workers of 1 core each. Then run the program among them. Does the run time reduce? And if you have 2 workers, 2 cores each? And 3 workers, 2 cores each? And… Is it worth parallelising a lot? Where is the optimum?
Bonus 3: It can become a problem when you have to copy and install your program and data “everywhere” in the same place. This can be alleviated by sharing your /home
folder via NFS. Can you set that up?
Bonus 4: Having to dowload, compile yourself the source code of the tool you need, and install it, is a very common workflow. Do you have a tool in this situation? Can it benefit from MPI? Please, let us know. Can you successfully get it running? Can you parallelise it?
Bonus 5: Using SSH might be a way to go along, but when you have multiple things to run at a time, ensuring users’ access, passwordless permissions… There exist cluster-building tools based on job queues, like Sun (now Oracle) Grid Engine, Torque, etc. Can you find out more? Can you set it up?
Bonus 6: MPI is an implementation of a technique for parallelising computations. Another common technique is shared memory. One implementation for that technique is OpenMP. You can read more about it at their website: http://openmp.org/wp/.
NOTE: Do not forget to shutdown your VMs when you are done with your performance tests.
If you want more of the advanced exercises on the HPC Cloud, see Extras.
In order to help measuring times, we can give you the hints in this section.
You can use the time
facility from bash
to measure the time it takes to run a command, like: time command
. It will return 3 lines via stderr with: overall time and the subparts for user time and system time.
The plain format for the output of time
is not very helpful for operating with the values; you can redefine it with the environment variable TIMEFORMAT
. The value %R
asks time
to return the amount of seconds and milliseconds; that is much easier to operate with.
awk '{sum += $1; sumsq+=$1*$1 } END {print " avg " (sum/NR) "\nstddev " sqrt(sumsq/NR - (sum/NR) * (sum/NR))}'
.
my_nums.txt
containing one number per line, you can compute average and standard deviation for those numbers like: cat my_nums.txt | awk '{sum += $1; sumsq+=$1*$1 } END {print " avg " (sum/NR) "\nstddev " sqrt(sumsq/NR - (sum/NR) * (sum/NR))}
You can run and measure a same command a given number of times by using a for-loop, like: for run in {1..3}; do time COMMAND_HERE; done 2>&1
wave4
program 20 times with 2 processors and then compute the average running time, using an intermediate output-collection file (while, at the same time, showing progress on screen), could be:TIMEFORMAT=$'elapsed %R' bash -c 'for i in {1..20}; do echo [iter $i]; time mpirun -n 2 ./wave4; done 2>&1 | tee times_n2.txt'
cat times_n2.txt | grep ^elapsed | sed -e s/.*\ // | awk '{sum += $1; sumsq+=$1*$1 } END {print " avg " (sum/NR) "\nstddev " sqrt(sumsq/NR - (sum/NR) * (sum/NR))}'
An explanation on the previous code listing:
The first line runs the command in a loop and takes all output into the times_n2.txt
file. Every iteration, the time
command outputs a line like elapsed ...
.
The second line reads the times_n2.txt
file; it filters all lines to take only those that start with elapsed
. For each of these lines, sed
takes only the number after the space. Finally, awk
processes all these numbers and, when it is done, it prints out the average and standard deviation.