32
High-Performance Computing Survival Guide James R. Knight Yale Center for Genome Analysis Department of Genetics Yale University January 14, 2015

High-Performance Computing Survival Guide James R. Knight Yale Center for Genome Analysis Department of Genetics Yale University January 14, 2015

Embed Size (px)

Citation preview

High-Performance Computing Survival Guide

James R. Knight

Yale Center for Genome Analysis

Department of Genetics

Yale University

January 14, 2015

1950’s – The Beginning...

2015 – Looking very similar...

...but there are differences

• Not a single computer but thousands of them, called a cluster– Hundreds of physical “computers”, called nodes– Each with 4-64 CPU’s, called cores

• Nobody works in the server rooms anymore– IT is there to fix what breaks, not to run computations (or help you

run computations)– Everything is done by remote connections

• Computation is performed by submitting jobs for running– This actually hasn’t changed...but how you run jobs has...

A Compute ClusterYou are

here!

louise.hpc.yale.edu

300+ Users.

90 compute nodes for general use.

300TB disk space.

Compute-3-2Login-0-1

Compute-3-1

Compute-2-2Compute-2-1

Compute-1-2

Compute-1-1

Network

You Use a Compute Cluster! Surfing the Web

ComputeBlah.com

Compute

ComputeCompute

Compute

Compute

Network

Construct the webpage contents

Return the webpage

You are

here!

Click on a link

How you’ll be using LouiseYou are

here!

Compute-3-2Login-0-1

Compute-3-1

Compute-2-2Compute-2-1

Compute-1-2

Compute-1-1

Network

Connect by qsub -IRun commands on

compute nodes (and submit qsub jobs to

the rest of the cluster)

Connect by ssh

louise.hpc.yale.edu

300+ Users.

90 compute nodes for general use.

300TB disk space.

1970’s – Terminals, In the Beginning...

2015 – Pretty much the same...

• Terminal app on Mac

• Look in the “Other” folder in Launchpad

Your “New” User Interface – Hunt and Peck!

• Type a command at the prompt, hit the return key

program arguments...

• This runs the program, which will read the arguments, read inputs, perform computations and produce outputs

• When it completes, the prompt is displayed, telling you it is ready for the next command

• Key commands to learn: ssh [email protected] qsub -I

11

Helpful Tips

• Take a Linux basics tutorial

• The faster you can type, the faster you will be done

• Select and learn a text editor– Vi or Emacs

• Select and learn a programming language– Perl, Python or R

• Ask these questions to keep you oriented– What computer am I on?– What directory am I in?– Where are the files for my analysis?– What program(s) do I have running?– What jobs do I have running?

Directories and Paths

• Linux directory structure same as Mac/Windows folder structure– Folders/directories containing files and other sub-folders/sub-dirs– “Easy-to-access” directories: HOME directory

• A path is a string naming a file or directory in the structure– The slash character (‘/’) is separator for directories

/Users/jamesknight/Desktop/hpc_survival_guide_jan_2015.pptx

The Shell

• When you type commands and run programs, you are actually running a program called a shell

– Designed to take user input, run programs and display output– Started automatically when Terminal app started or when you log into a

computer– Linux runs the bash shell, by default

• Maintains useful environment variables– $PWD, which holds your current working directory path– $HOME or ~, which holds your home directory path– $PATH, which holds locations of programs

• Powerful tool for organizing and executing commands– Useful to combine programs or redirect inputs and outputs, without having

to write a program to do that– Full-fledged programming language, used to write shell scripts to run sets

of commands

The Program’s Viewpoint

• Programs start knowing nothing, and must figure out what to do– Lines of code are generalized instructions– Specifics come from reading the program’s environment

TheProgram

StandardInput

(keyboard)

Command-lineArguments

(what you typed)

Filesto read

Filesto write

StandardOutput(screen)

StandardError

(screen)

Shell Redirection, Piping and Multiple Commands

• The shell lets you redirect stdin, stdout and stderr to configure how your program communicates

• myprog < inFile > outFile 2> errFile– “< inFile” redirects stdin so that program reads contents of “inFile”– “> outFile” redirects stdout so that program writes standard output to

“outFile”– “2> errFile” redirects stderr so that program writes standard error to

“errFile”

• echo Hello | sed s/Hello/Goodbye/– The “|” (called a pipe) redirects the echo program’s standard output so that

it writes to the standard input of the sed program– This command writes “Goodbye” to the screen

• echo Hello ; echo Goodbye– The semi-colon separates commands, allowing multiple programs to run

from one command-line– This command writes “Hello” then “Goodbye” to the screen

Writing Scripts

• Sometimes Linux’s built-in programs, and existing bioinformatics programs, are not enough

– To combine programs together in a specific way– To run programs on many different files/datasets– To perform custom statistical analyses on data files

• Scripting languages make it easy to write your own programs– bash, perl, python, R– Write the lines of the script using a text editor– Use the language’s program to run the script

perl myscript arguments...

Then, test, debug and rewrite...

Writing Scripts

• A script is like a lab protocol– Instructions on how to perform a task– Executed in order, from beginning to end– Just as protocol steps can have sub-steps, repeated steps and

sub-protocols, script statements can have sub-statements, loops and function calls

• Types of statements in a script– Computation (assignment, input/output), if-then-else,

for and while loops, functions

• Each programming language has its own unique syntax that you must follow

REMEMBER: You are the protocol writer... ...writing for someone very, very, very stupid

Writing Scripts

• Instead of reagents, tubes and plates, scripts operate on values, variables, data structures and files

– Values: numbers (1, 2, 87.5), strings (“I am a string!”)– Variables: holder for a value– Data structures: holder for collections of values– Files: Series of strings (text files) or numbers (binary files) stored

on disk

• Important data structures:– List or Array – ordered collection of values [ 1, 2, 4, 3 ]– Hash or Dictionary – collection of “name, value” pairs, like a

telephone book– Record or Struct – collection of named variables/data-structures– Matrix – two-dimensional collection of numbers

That’s fine, but how do you do this, really???

• My best recommendation: Think about it, and write it down, as a protocol, then translate it into the programming language

– Make the step descriptions comments in the script• Comments are lines beginning with ‘#’, which are ignored when

executing the script– Refine into sub-steps when translation is difficult

• Example: writing echo in Perl– Echo takes the command-line arguments and writes them to

standard output

[jk2269@compute-7-2 ~]$ echo Hello from the cluster!Hello from the cluster![jk2269@compute-7-2 ~]$

That’s fine, but how do you do this, really???

• Attempt #1: Implement that description– Perl has a @ARGV list with the command-line arguments– Perl has a print statement to write to standard output

• Program:

[jk2269@compute-7-2 ~]$ perl myecho.pl Hello from the cluster!Hellofromthecluster![jk2269@compute-7-2 ~]$

## Write the command-line arguments to stdout.#print @ARGV;

That’s fine, but how do you do this, really???

• Attempt #2: Refine, write each argument separately, so that the output can be formatted better.

– Perl can loop over the values of the @ARGV list– The print statement can write string values like “ “ (a space)

• Program:

[jk2269@compute-7-2 ~]$ perl myecho.pl Hello from the cluster!Hello from the cluster! [jk2269@compute-7-2 ~]$

## 1. for each command-line argument,# a. write the argument# b. write a space#for $arg (@ARGV) { print $arg; print “ “;}

That’s fine, but how do you do this, really???

• Attempt #3: Fix where the prompt is shown. (Ignore the extra space.)

– Printing a special “\n” string value outputs a newline character

• Program:

[jk2269@compute-7-2 ~]$ perl myecho.pl Hello from the cluster!Hello from the cluster! [jk2269@compute-7-2 ~]$

## 1. for each command-line argument,# a. write the argument# b. write a space# 2. write a newline#for $arg (@ARGV) { print $arg; print “ “;}print “\n”;

That’s fine, but how do you do this, really???

• Attempt #4: Try a different approach, construct the string to be output, then print it.

– Perl has a join function that combines a list of strings into a string, and can include a separator.

• Program:

[jk2269@compute-7-2 ~]$ perl myecho.pl Hello from the cluster!Hello from the cluster![jk2269@compute-7-2 ~]$

## 1. Combine the command-line arguments into a# string, separating them by spaces# 2. Write that string# 3. write a newline#my $line = join(“ “, @ARGV);print $line;print “\n”;

That’s fine, but how do you do this, really???

• Why scripting/programming is hard:– You must think of everything

• Use testing, iteration and refinement to make sure that you have thought of everything

• You can get to “good enough”– You have to write everything in a foreign language, with no

allowance for error

• My best recommendation: Think about it, and write it down, as a protocol, then translate it into the programming language

– Design what you want the program to do as you would a protocol, in English (or your favorite language)

– Match program statements to the steps, refining the steps so that they can be translated

Running Jobs on the Cluster

• You must make reservations!– Cluster is a shared resource, so you must ask for exclusive use of

nodes and cores– The job request goes into a queue, and is granted when resources

are available– How to do this? qsub!

• Interactive jobs– “qsub –I” – request 1 core on 1 node– “qsub –I –l nodes=1:ppn=8” – request 1 node, with 8 cores

• Batch jobs– “qsub myjob.pbs” – Request to run the bash script myjob.pbs

• Louise’s cluster runs PBS/Torque to manage the queues, so “.pbs” suffix is marking this as a script that can be submitted to the cluster

Running Jobs on the Cluster

#PBS –q general#PBS -l nodes=3:ppn=8#PBS –o myjob_outFile.txt#PBS -e myjob_errFile.txt

source ~/.bashrc

cd /data/scratch/firstjob_Jan2015

echo Helloecho Goodbye

Example myjob.pbs file

Lines containing

options for the job request

Just do this

Set working directory

The lines of your script

Running Jobs on the Cluster

• What if I have to run a program on 100 datasets?– You could make 100 scripts, or you could use Simplequeue!

• Write a text file, where each line is a one-line shell command• Use the sqPBS.py program to make a PBS script• Submit the PBS script

• Perl program that can write the text file (let’s call it “writeit.pl”)

• Commands to run

use Cwd;my $pwd = cwd();for $arg (@ARGV) { print "source ~/.bashrc ; cd $pwd ; perl myscript $arg\n";}

perl writeit.pl dataset*.gz > runit.smplqsqPBS.py general 3.2 jk2269 myscript runit.smplq > runit.pbsqsub runit.pbs

What do you need to know how to do to “survive”?

• How to get into the cluster, and back out again.

• How to run commands in the shell.– How to type statements into R.

• How to navigate around the directories (and make and remove them).

• How to create, look at and edit text files.

• How to write scripts to do the computations you need to do.

• How to submit jobs, to run things on the compute nodes.

Helpful Tips

• Take a Linux basics tutorial

• The faster you can type, the faster you will be done

• Select and learn a text editor– Vi or Emacs

• Select and learn a programming language– Perl, Python or R

• Ask these questions to keep you oriented– What computer am I on?– What directory am I in?– Where are the files for my analysis?– What program(s) do I have running?– What jobs do I have running?

Helpful Tips

• Ask these questions to keep you oriented– What computer am I on?

• Look at the prompt, ‘hostname’– What directory am I in?

• Look at the prompt and window top• ‘pwd’, ‘cd’

– Where are the files for my analysis?• ‘ls’• ‘mkdir’, ‘rm’, ‘rmdir’• ‘more’ or ‘less’, ‘head’, ‘tail’

– What program(s) do I have running?• ‘ps’, ‘top’, ‘screen’

– What jobs do I have running?• ‘qstat’

Golden Rule for Bioinformatic Clusters

• Never, ever, ever read and write SAM files. Always pipe it through samtools to convert from SAM to BAM, if the software doesn’t support native BAM files.