_images/NC_State.gif

Introduction to Linux and the Command-line Interface

Objective

The objective of these class sessions is to introduce participants to the Linux computing environment, with a particular focus on the Unix environment provided as a virtual machine through the NC State Virtual Computing Laboratory. An introductory lecture on key elements of Linux system architecture and computing philosophy will be followed by hands-on computing exercises to provide experience in using command-line utilities to navigate the file system, manage files and directories, and carry out basic file processing tasks. Demonstrations of how these command-line utilities can be applied to sequence analysis tasks are integrated into the exercises.

Description

Introductory slides provide an introduction to the course objectives and the Linux operating system in the first class session, and a summary of Chapter 1 from Eric Raymond’s book The Art of Unix Programming (complete text available here) is used as a framework for discussion of differences between the Linux command-line interface and graphical interfaces. File globbing and regular expressions provide a basis for discussion of abstraction and generalization as key parts of computational thinking.

Global Overview

Linux is the operating system of choice for computationally-intensive data analysis, because of its design and the efficiency with which it runs. Much open-source software for sequence data analysis is written for Linux, although there is an increasing number of Java-based programs that can run under Windows. A key element of the philosophy behind Unix and Linux operating systems is decomposition of tasks into simple categories – separate command-line utilities are available for separate tasks. Combining these simple individual tools into pipelines provides enormous flexibility for managing and processing data. Abstraction is another key concept in computing - generalizing from a specific case to a larger group of cases that all meet a specific set of criteria. The use of “wildcard” or meta-characters to specify groups of files is a simple example; this process is commonly called file globbing. Regular expressions are another powerful example of abstraction and generalization as key parts of computational thinking.

Key Facts

DNA sequence data and most results of analysis are stored in plain text format, often compressed using an open-source algorithm to reduce the size of the files stored on disk. A few dozen commands for manipulation of text files, executed either separately or in different combinations, provide an enormous range of data manipulation and analysis capabilities. A modest investment of time in learning basic file formats and commands for text file manipulation will pay large returns by enabling you to manage large data files and carry out basic analyses on the command line, without any specialized software for sequence analysis. The file sizes used in bioinformatics analysis are often large, so parallel processing using multiple CPU cores for the same job can be a valuable tool in getting things done efficiently.

Exercises

  1. An Introduction to Linux is a tutorial to guide participants through an 8-step introduction to the linux operating system and virtual computing lab (VCL) access used for most class computing exercises. As an initial exercise after connecting to a VCL instance in the first part of the tutorial, we’ll use the commands in fastq-dump.exercise.sh to explore some features of Linux and the virtual machine available through the VCL. Open the Intro to Linux PDF file and follow the directions to connect to a VCL instance, then open a browser in the VCL instance, navigate back to this page, and download the fastq-dump.exercise.sh file to the instance. Follow the directions in that file to compare the relative speed and resource requirements for different methods of achieving the same result.

  1. A list of useful Linux commands is available as a handy reference.

  1. The example files for the week1 quiz are at quiz_week1.tgz.

  1. Some links to useful websites with more information about Linux and the bash shell: The BashGuide, An A-Z Index of the Bash Command Line, and LinuxCommand.org.

  1. A quick Quiz to gauge linux command line proficiency.

Additional Resources

Background information about Linux:

  • The Software Carpentry website has a series of tutorials with introductions to many aspects of Linux computing. The lessons entitled The Unix Shell and Programming with R are particularly relevant to this course, because we use the shell throughout the course, and R is important in the section on transcriptome analysis.

  • The Harvard Chan Bioinformatics Core offers teaching materials suitable for instructor-led or self-guided learning. Introduction to the command line is the first in a series; other modules cover the R statistical environment, RNA-seq analysis, and ChIP-seq analysis.

  • Data Carpentry has a module Introduction to the Command Line for Genomics that includes exercises. The course is intended to use an Amazon Machine Image cloud-computing environment, but the VCL image available at NC State should be a workable alternative.

  • The LocaleSettingDetails.pdf document covers localization options in UNIX, including the ‘C’ locale, and how it may affect alphabetical processes.

  • One aspect of command-line use is knowing when to use a particular command, and when it is not needed. Many command-line utilities such as grep, cut, wc, sort, sed, and awk (among many others) accept filenames as arguments after the command, but will also accept input from stdin via a pipe. Other utilities, such as tr, do not accept a filename as an argument and only process data received from stdin. Some people prefer to use the cat command to put data into a pipeline, even when the command being used could read the filename as an argument, simply for the sake of consistency and style (see this StackOverflow discussion as an example), while purists argue that using a command when it is not required means running two processes when one will do. This is rarely a problem, but can lead to differences in the commands needed to accomplish the desired result. One example is the ‘sort’ utility - if this utility receives input from stdin via a pipe, it uses default settings that may be different from those it would use if it opened the file directly. This causes important differences in performance unless specific options are used; see this post for the details.

Windows options for access to Linux tools

  • Windows 10 offers an optional beta-release of Windows Subsystem for Linux (WSL), which allows running any of three different Linux-like command-line environments on Windows, although the Linux kernel itself is not installed. These provide a command-line bash shell environment with GNU utilities - see a tutorial on set-up or a Microsoft page. The WSL environment is separate from the Windows environment on the same computer, although it is possible to set up shared file space accessible from both environments.
  • The MobaXterm program is available in both free and paid versions, and provides a fairly complete package of both network tools for connection to remote computers (e.g. ssh, scp, sftp, and X11 graphics, among others) as well as over 200 Linux command-line utilities that can be used to operate on files and directories in your Windows environment. This program is recommended by the NC State High-Performance Computing (HPC) for Windows users who use the HPC cluster.
  • Cygwin is a relatively complete set of Linux tools and programs compiled to run on Windows systems, including systems older than Windows10. If you have an older Windows system, or want an alternative to Windows Subsystem for Linux, this may be an option to consider. MobaXterm uses Cygwin utilities, and includes many of the most commonly-used tools, but is not as comprehensive as a full Cygwin installation.

Setting up an Amazon Web Service account to use Elastic Compute Cloud services:

  • A 2013 guide to setting up an Amazon Web Services account is available for those interested in using cloud-based computing resources, and a 2013 guide to preparing and running a Cloudbiolinux instance on the Amazon Web Services Elastic Compute Cloud (AWS-EC2), is also available. The BIT815 course no longer uses AWS resources, so these documents have not been updated to reflect any recent changes in AWS procedures – users are cautioned to follow the instructions on the AWS website rather than those in these documents in case of any conflict.

Class Recordings

Last modified 18 January 2022. Edits by Ross Whetten, Will Kohlway, & Maria Adonay.