Course Resources¶
Introductory Concepts and Vocabulary¶
The world of command-line Linux has a specialized vocabulary that is second nature to those who work in that world, but can be a barrier to those without a background in computer science or programming. The first three chapters of Eric Raymond’s book The Art of Unix Programming provide an introduction to the philosophy underlying the Unix and Linux operating systems, a brief history, and some comparisons among operating systems to help new users understand some important differences between Linux and other operating systems.
One key difference between open-source Linux and closed-source commercial operating systems is the modular nature of Linux. A Linux operating system is a collection of individual programs that work together to provide the desired functionality for a user, and Linux recognizes that different users desire different functionalities. A Linux distribution is a package of programs selected to work together to provide core functions, but users are free to add additional programs with new functions to meet their specific needs. This has to be done in an organized way, to assure that the added programs are compatible with the core functions of that distribution, so it is common for distributions to maintain repositories of programs that are suitable for installation. For example, the Debian family of Linux distributions (98 derivative distributions as of Jan 2018, per the Debian Wiki webpage) use a program called apt (Advanced Package Tool) to manage packages. The command apt-cache dump | grep -c “^Package:” yields a count of over 112,000 packages in the repositories for Ubuntu, one of the “children” of the Debian distribution. Many of these are components that work together, while others are redundant or provide overlapping functions, so no single system would need to have all of them installed, but they are available to allow users to configure a Ubuntu-based system to meet their own needs.
Many of the programs used for bioinformatics are available in Debian or Ubuntu repositories, which is one reason why this family of distributions is popular with users of such programs. Searching the respository is a good first step to take when the need arises to install a particular bioinformatics tool. For example, Bio-Perl is a set of specialized modules in the Perl scripting language that provide capabilities for DNA and protein sequence analysis and other bioinformatics tasks, and this tool is often utilized by other bioinformatics programs that depend upon these specialized capabilities. Executing the command apt-cache search bioperl on the command line of an Ubuntu-based Linux system shows that there are several packages related to Bio-Perl in the Ubuntu repositories. It is often true that the most recent versions of software are not available in the repositories, so if you need a specific recent version of a particular program (perhaps because another program depends on new functions introduced in that version), you may have to install from source rather than from the repository.
Cartoons about biology or computing from xkcd¶
Resources about the Linux command-line environment and the Bash shell¶
- Greg’s Wiki BashGuide provides a good introduction to the fundamental concepts of the command line interface, including sections on regular expressions, variables and arrays, tests and conditionals, job control, and scripting.
- The Advanced Bash Scripting Guide is a comprehensive resource of information about many aspects of Bash programming, and also includes appendices with introductions to awk and sed, which are powerful tools for managing and manipulating text on the command line.
- Parameter expansion is a wiki page with information about bash parameter expansion, a handy tool for manipulating variables in the bash shell.
- A guide to setting up a computing session on a virtual machine through the NC State Virtual Computing Lab (VCL) is available. This resource is only available to members of the NC State community, because the VCL requires authentication with an NC State user id and password. The work done on a virtual machine instance is lost when the instance is terminated, so if you want to save results of analyses done on a virtual machine, you must either upload the files to a cloud storage site (e.g. Google Drive, Dropbox, or something similar) or use NCSU Drive or AFS file space mounted as an external volume on the virtual machine. The AFS file space is mounted by default and a shortcut (called AFS) is placed in the home directory; the NCSU Drive space can be mounted by executing the command mount.mydrive at a terminal prompt and providing your NC State Unity password in response to the prompt. After your password is accepted, your NCSU Drive storage is accessible at /mnt/mydrive.
Advice about bioinformatics from blog posts and papers¶
- Mick Watson weighs in with an opinion about five bad habits bioinformaticians should avoid.
- Ten simple rules for reproducible computational research from Sandve, et al., PLoS Comput Biol 2013.
Links to sequence and alignment data files used in exercises¶
- Data from reduced-representation sequencing of the genome of spotted gar are available in bamfiles.tgz, a gzipped tar archive containing BAM alignment files for 94 progeny and two parents of a full-sibling family. Only data from linkage group 2 are provided, to keep the file sizes manageable. The reference sequence for the linkage group is available in LG2.fa.gz, and annotation is in LG2.gff3.gz. See Amores, et al, 2011 for a complete description of the experimental design and data.
- sampleReadsSAM.tgz A gzipped tar archive containing two 100-nt paired-end fastq-format sequence files and a SAM alignment file with results of aligning those reads to a small sample of contigs from a reference transcriptome.
- DPC4571.fasta.gz A gzipped file containing a fasta-format sequence file of the Lactobacillus helveticus strain DPC4571 genome.
- OWB_RAD.fastq.gz A gzipped file containing fastq-format sequences from an early RAD-seq experiment with the Oregon Wolfe barley lines.
- t3.fq.gz A gzipped file containing fastq-format sequences from test sample 3 of the Cumbie et al RNA-seq experiment with Arabidopsis, used for data QC, reference-guided transcriptome assembly, and differential gene expression exercises.
Links to useful Wikipedia pages¶
- FASTQ format definition and explanation
- FASTA format definition and explanation
- K-mer definition and description on Wikipedia
- Sequence alignment software page, with a fairly comprehensive list of open-source and commercial programs for analysis of DNA sequence data.
Links to other sequence data analysis course materials¶
- ANGUS 5.0, the Michigan State University course on Analyzing Next-Generation Sequencing data.
- Analyzing Next-Gen Sequencing Data 2013 slides, homework, and notes from a course taught by Istvan Albert at Penn State
- Unix & Perl Primer for Biologists an on-line course by Keith Bradnam and Ian Korf at UC-Davis, for biologists interested in learning Unix and the scripting language Perl.
- Course on UNIX and Genomic Data, Prague, Jan 2016 - complete materials, provided by Libor Morkovsky and Vaclav Janousek. The materials include exercises, both in the “Additional exercises” link and in PDFs of course slides under the “Slide decks” link. The slide sets under the “old” heading are from a course taught in April of 2015, and are not out of date.
Links to software pages on Github and Sourceforge¶
- SAMtools and BCFtools versions 1.1 and higher, on Github: tools for processing SAM/BAM alignment files and VCF/BCF variant call files.
- SAMtools and BCFtools version 0.1.19 and earlier, on Sourceforge: earlier versions of tools for processing SAM/BAM alignment files and VCF/BCF variant call files.
- Flexbar on Sourceforge: a tool for barcode-splitting of single-end or paired-end reads, quality filtering and trimming, and adapter removal, with links to download source code and to the manual with complete documentation.
- bioawk on Github: a version of the awk text-processing utility with specific features added to speed processing of biological data files, including BED, GFF, SAM, VCF, and Fasta/Fastq files. This includes the ability to read and write gzipped files, which standard awk cannot do.
- Musket on Sourceforge: a multi-stage, k-mer spectrum based error correction program capable of multi-threaded error correction of Illumina short reads.
Last modified 2 January 2020. Edits by Ross Whetten, Will Kohlway, & Maria Adonay.