UvA HPC course 2021-01-22
Extras - Principal Component Analysis (PCA)

UvA logo

SURF logo

This is an exercise from the Extras part of the Tutorial UvA HPC course 2021-01-22.

In this advanced part of our HPC Cloud tutorial we ask you to run an exercise to look at the scale-out and scale-up scenarios. You will be using Principal Component Analysis to study flight delays where data can be analysed by scaling up one VM or scaling out across multiple VMs.

The original dataset is coming from here, but we have already prepared some files. Among other preparation steps, we focused on a selection of dates and some variables, along with some cleaning steps to get more useful data.

NOTE:

You are now in the advanced section of the workshop. You have your laptop and an Internet connection. We expect you will be able to find out more on your own about things that we hardly/don’t explain but which you think you need. For example, if we were you, at this point we would’ve already googled for several things:

  1. Principal Component Analysis
  2. R language
  3. R modules

a) Setting up a VM for the exercise

Start a new single core VM with 1 GB memory (you are now in the advanced part; you should be able to do this on your own). The steps in this exercise assume that you are using an Ubuntu image.

b) Prepare the VM for data analysis

In this part of the exercise you shall prepare the software and download data for analysis. After logging int othe VM:

sudo apt-get install r-base
R --version
wget http://doc.hpccloud.surfsara.nl/UvA-20210122/code/airplane-delay.tar
tar -xvf airplane-delay.tar

Food for brain:

  • What version or R do you have?
  • How can you inspect the files without opening them?

c) Run the Principal Component Analysis

cd ~/airplane-delay
Rscript airplane-delay-all-comp.r

You just ran an R script and saw the output. What do these numbers mean? Which variables (columns) were used to perform the PCA?

Rscript airplane-delay-plots.r

For simplicity we only plot a part of the data. You may use all the datapoints to create the plots.

Food for brain:

  • How can you display these plots? (Hint: You can login with X11 forwarding enabled)
Rscript airplane-delay-some-comp.r

How do these numbers compare to the previous analysis with the full dataset? A similar example can be found here that can help you in the interpretation and further analysis of the results.

Bonus food for brain: Scaling up or scaling out?

So far you worked on a single dataset on a single core VM with 1 GB memory. There are two datasets provided to you in the airplane-delay.tar file and data files for another few months are available here (delay-2018-*.csv). How would you run the analysis for the year as a whole?