_images/NC_State.gif

HPC and LSF

Objective

The objective of this section is to work through NCSU’s High-Performance Cluster (HPC) quick-start tutorial to gain familiarity with the NC State HPC, and use other reference videos to review and practice cluster job submission using LSF, the job management system of the HPC. Singularity containers are an important asset for working on the HPC, because they allow preservation of exactly the same working environment for the duration of a project, and can be made available to other scientists in pursuit of the goal of reproducible research. Unlike Docker containers, which require root access to run, Singularity containers run with the same permissions of the user who launches the container, but can still be deployed on computing clusters to take advantage of the hardware resources available there. Access to the NC State HPC is available to faculty, staff and students at NC State University - see Request Access for more information. Mac and Linux users can connect directly to the HPC by SSH to login.hpc.ncsu.edu from a terminal window; Windows users should install MobaXterm to provide a terminal environment for SSH as well as a number of other useful tools. The HPC uses a tool called Load Sharing Facility (LSF) to manage the queues of computing jobs submitted for processing on the HPC - see the LSF Document for more information about LSF scripts.

Exercises

Using NC State University’s High-Performance Computing (HPC) Service:

  1. Go to the quick-start guide page on the HPC website.
  2. Either watch and follow along with the video guide or work through the text version at the bottom of the same page.
  3. A Zoom video recording of the class session working through the Quick-Start tutorial is available with this link (or by video download & transcript download)
  4. A text file is available with an overview of commands used to set up a Conda environment on the HPC, and an example LSF job script used to carry out a series of commands using software installed through Conda. A Zoom video recording of the class session working through conda setup and LSF job submisson is available with this link (or by video download & transcript download).
  5. A text file is available with an outline of steps required to set up a Conda environment with Trinity and Transdecoder, do de novo transcriptome assembly, and carry out functional annotation. Sorting out the details of how to do these steps is left as an exercise. A Zoom video recording of the class session working through these steps is available with this link (or by video download & transcript download).
  6. Notes from 10 April 2020 - an exercise to carry out assembly and annotation of a yeast RNA-seq dataset using Trinity, TransDecoder, and Trinotate installed in a Conda environment.

Final Project

The Final Project is optional, but highly recommended to complete. The goal of this project (should you choose to accept it), is to utilize the HPC and LSF scripting to..

  1. Do a de novo transcriptome assembly with yeast RNA-seq data.
  2. Identify open reading frames, and annotate those ORFs using results of protein similarity searches, protein domain analysis, and various other tools.

The software to be used includes the Trinity assembler, the TransDecoder suite for identification of ORFs, and the Trinotate suite for annotation. You will need to install the necessary tools to a Conda environment on the HPC. The text file linked to the HPC and LSF section under section 6 of Exercises has some notes on how to set up and run the Conda environment to carry out the Trinity assembly as an example. The diamond (protein similarity search) and Pfam (protein domain search) databases are available at /share/bit815s20/databases. The RNA-seq data are in /share/bit815s20/yeast/RNAdata - look for the six files [rnaC1, rnaC2, rnaC3][_1.fastq.gz,_2.fastq.gz]. The TransDecoder and Trinotate pipelines will be similar, but (of course) using commands specific to those software packages.

More detailed information on how to structure the commands for Trinity, TransDecoder, and Trinotate is available at the respective websites for those software packages. The ability to find the information you need to understand how to use software is an important skill to practice, as new software and new methods emerge all the time. However, we are available to answer questions, either via the Slack channel or during class meetings.

Resources

NCSU OIT HPC main page

NCSU HPC LSF guide

NCSU HPC conda guide - Conda is the preferred method for installing software on the HPC, because it provides a way to manage dependencies and version requirements. Conda was originally written primarily for managing Python virtual environments, but has been extended to include a variety of software not related to Python. See the Bioconda site for more information about the enormous variety (> 7000 different bioinformatics programs) available for installation through the bioconda channel of Conda.

NCSU Bioinformatic Users Group (deBUG) ReadtheDoc site.

Singularity v3.5 documentation from Sylabs.io

Using Singularity on the NIH HPC is documentation for the NIH cluster, but has lots of useful advice and links with more information

Using Singularity on the NC State BRC cluster is specific to the BRC cluster, which uses SLURM rather than LSF. This also has good advice and links.

Class Recordings

Last modified 4 January 2022. Edits by Ross Whetten and Will Kohlway.