_images/NC_State.gif

Course Resources

Introductory Concepts and Vocabulary

The world of command-line Linux has a specialized vocabulary that is second nature to those who work in that world, but can be a barrier to those without a background in computer science or programming. The first three chapters of Eric Raymond’s book The Art of Unix Programming provide an introduction to the philosophy underlying the Unix and Linux operating systems, a brief history, and some comparisons among operating systems to help new users understand some important differences between Linux and other operating systems.

One key difference between open-source Linux and closed-source commercial operating systems is the modular nature of Linux. A Linux operating system is a collection of individual programs that work together to provide the desired functionality for a user, and Linux recognizes that different users desire different functionalities. A Linux distribution is a package of programs selected to work together to provide core functions, but users are free to add additional programs with new functions to meet their specific needs. This has to be done in an organized way, to assure that the added programs are compatible with the core functions of that distribution, so it is common for distributions to maintain repositories of programs that are suitable for installation. For example, the Debian family of Linux distributions (98 derivative distributions as of Jan 2018, per the Debian Wiki webpage) use a program called apt (Advanced Package Tool) to manage packages. The command apt-cache dump | grep -c “^Package:” yields a count of over 112,000 packages in the repositories for Ubuntu, one of the “children” of the Debian distribution. Many of these are components that work together, while others are redundant or provide overlapping functions, so no single system would need to have all of them installed, but they are available to allow users to configure a Ubuntu-based system to meet their own needs.

Many of the programs used for bioinformatics are available in Debian or Ubuntu repositories, which is one reason why this family of distributions is popular with users of such programs. Searching the respository is a good first step to take when the need arises to install a particular bioinformatics tool. For example, Bio-Perl is a set of specialized modules in the Perl scripting language that provide capabilities for DNA and protein sequence analysis and other bioinformatics tasks, and this tool is often utilized by other bioinformatics programs that depend upon these specialized capabilities. Executing the command apt-cache search bioperl on the command line of an Ubuntu-based Linux system shows that there are several packages related to Bio-Perl in the Ubuntu repositories. It is often true that the most recent versions of software are not available in the repositories, so if you need a specific recent version of a particular program (perhaps because another program depends on new functions introduced in that version), you may have to install from source rather than from the repository.

Cartoons about biology or computing from xkcd

Resources about the Linux command-line environment and the Bash shell

  • Greg’s Wiki BashGuide provides a good introduction to the fundamental concepts of the command line interface, including sections on regular expressions, variables and arrays, tests and conditionals, job control, and scripting.
  • The Advanced Bash Scripting Guide is a comprehensive resource of information about many aspects of Bash programming, and also includes appendices with introductions to awk and sed, which are powerful tools for managing and manipulating text on the command line.
  • Parameter expansion is a wiki page with information about bash parameter expansion, a handy tool for manipulating variables in the bash shell.
  • A guide to setting up a computing session on a virtual machine through the NC State Virtual Computing Lab (VCL) is available. This resource is only available to members of the NC State community, because the VCL requires authentication with an NC State user id and password. The work done on a virtual machine instance is lost when the instance is terminated, so if you want to save results of analyses done on a virtual machine, you must either upload the files to a cloud storage site (e.g. Google Drive, Dropbox, or something similar) or use NCSU Drive or AFS file space mounted as an external volume on the virtual machine. The AFS file space is mounted by default and a shortcut (called AFS) is placed in the home directory; the NCSU Drive space can be mounted by executing the command mount.mydrive at a terminal prompt and providing your NC State Unity password in response to the prompt. After your password is accepted, your NCSU Drive storage is accessible at /mnt/mydrive.

Advice about bioinformatics from blog posts and papers