_images/NC_State.gif

Awk, Sed, and Bash: Command-line file Editing and Processing

Global Overview

The process of collecting, managing, and analyzing high-throughput sequencing data requires the ability to handle large text files, sometimes 10 Gb or more in size. Command-line tools optimized for efficiency and speed are often the best for simple tasks of summarizing data, modifying file formats, and automating repetitive tasks. The four steps of computational thinking are directly applicable in this context - break a complex problem down into simple steps, look for patterns and similarities among steps, then find a general solution for each class of steps and put those solutions together into an algorithm or pipeline that will accomplish the desired task. For repetitive tasks, loops are a useful tool that allow a script to carry out the same sequence of commands over and over until a desired endpoint is reached.

Objective

Awk, sed, and bash are command-line tools that provide tremendous power for managing and manipulating text files of arbitrary size, from very small to extremely large. The objective of this class session is to give course participants experience in using these tools to carry out specific tasks, and experience in searching on-line resources to find out how to accomplish specific objectives with these tools.

Description

The bash shell is the default command-line user interface in the Lubuntu 18.04 Linux system used for the course. A shell script is simply a text file that contains a series of commands recognized by the bash shell, which allows users to create standard workflows and use them over and over with different input files. This is a powerful tool to automate routine or repetitive tasks in data management or analysis, and learning some basic skills can make these tasks much easier. Awk is a scripting language that is particularly well-suited to handling tabular data in text files, such as SAM alignment files or VCF files of DNA sequence variant data. Sed is a “stream editor”, a program that allows manipulation of text files one or two lines at a time, as the text passes through a series of piped commands. The capabilities of these three tools overlap, and many tasks can be accomplished using any of them, but each has its own particular advantages for specific types of problems. Handling multiple files is made easier using file globbing, as described in the FileGlobbing.pdf document, while the RegularExpressions.pdf file has provides an overview of regular expressions, a more general and powerful tool for pattern matching in text files. A powerpoint presentation from a previous year’s lecture is available with this link.

Key Facts

Sequence data analysis often requires the ability to examine and modify the contents of text files, and this is exactly the purpose for which awk and sed were designed. Combining these tools with command-line utilities such as cut, sort, uniq, grep, and other shell functions provides powerful capabilities for summarizing or re-formatting data files. The “modulo” operator (%) in awk, for example, is well-suited to the challenge of working with sequence files in FASTA or FASTQ format, where specific information is found in a particular line within each group of two (for FASTA) or four (for FASTQ) lines. The bioawk version of awk removes the need for this trick by allowing the user to specify that the input file format is ‘fastx’, meaning either FASTA or FASTQ, and the program then assigns the variables $name, $seq, and $quality to the appropriate records in the input file. Another specialized version of awk is vawk, which is designed for manipulation of VCF files containing data on the locations of SNPs and other sequence variants as well as which alleles of those variants are detected in a set of samples. Both of these programs are installed in the VCL machine image, so you can compare them and decide for yourself which you prefer. A sample VCF file is available here for use with bioawk and vawk; the official format specification for the Variant Call Format is available on the Github website for VCFtools.

Exercises

awk_sed_bash.txt has the list of links for the data files needed for the following exercises.

  1. Command-line exercises. Finding and managing files in Linux systems can be a major component of bioinformatics projects, so learning tools to make those taskes easier is important. File globbing and regular expressions are related concepts important for file management, and also useful in writing scripts to speed up analysis. Writing and executing loops is a key skill to learn in programming, because this makes completion of repetitive tasks much easier. The bash shell also provides a wide variety of tools to manage system functions, maintain software, and track system resources. Awk allows use of both conditional statements and loops to process and manipulate text files, and can carry out many text-processing activities commonly done using spreadsheet programs in a Windows environment. Awk can do basic arithmetic, but is not intended for statistical analysis - datamash is a better option for doing basic statistical summaries on large tabular files.

  1. Exercises using sed, bioawk, and bash to modify and summarize data in files.

  1. Handy tips for bash, awk and sed - these are examples I have saved from my own applications of these tools. You may find some of these tips useful, but these lists are by no means complete, so feel free to add additional information and keep your own list of the most useful tricks for each of these tools.

  1. An online resource called Linux Command Line Exercises for NGS Data Processing is mostly about awk.

  1. One feature of the bash shell mentioned in the list of handy bash shell tricks is parameter expansion, which offers a range of tools for modifying the values of variables. One example of the utility of these tools is processing a set of FASTQ sequence files - suppose there are samples named S001 to S150, so the sequencing center splits the reads into 150 files named S001.fq.gz to S150.fq.gz. If all these files are saved in a directory, a bash loop can be used to align them to a reference genome, but simply using the input filename as the base for the output alignment file will result in files named S001.fq.gz.bam to S150.fq.gz.bam, in which the “fq.gz” no longer serves a meaningful role. For a variable called $file, parameter expansions such as ${file%%.*} can be used to retrieve specific parts of the string of filenames and extensions. The ${file#} and ${file##} constructs remove matched patterns from the left end of the string stored in the $file variable, while the ${file%} and ${file%%} contructs remove matched patterns from the right end of the stored string. Examples make this somewhat more clear, but the best way to see how it works is to practice (for example on the files saved in the AtRNAseq archive).

  1. Another useful tool in bash is process substitution, the ability to nest commands inside other commands to combine outputs from different files and commands into a single process. For example, to compare column 2 from one multi-column tabular file to column 3 from a different tabular file and report differences between them:
diff <(cut -f2 file1) <(cut -f3 file2)

  1. Basic walkthrough of AWK from chapter 1 of “The AWK Programing Language” with datamash examples as well. A scan of the complete text of The AWK Programing Language, is also available - this is the authoritative resource for AWK, written by the original authors of the language.

  1. Datamash examples and exercises is a file with example problems for practice with Datamash as well as a comment on how to deal with non-uniform filenames.

Additional Resources

  • The IBM developerWorks site has a three-part series on awk.

  • The blog TheUnixSchool has a page with example awk and sed commands to accomplish specific tasks, as well as a grep search function to find previous postings on any topic of interest (look on the right side of the page, below the “join us on RSS Twitter Facebook Google+” box).

  • The LinuxCommand.org website contains tutorials called Learning the Shell and Writing Shell Scripts that provide a good introduction to shell commands and strategies for writing scripts to combine individual commands into a coherent and efficient workflow. There is also a link to a book called The Linux Command Line which can be downloaded as a PDF.

  • A quick guide to organizing computational biology projects. Noble, PLoS Computational Biology 5:1000425, 2009 This paper offers a suggested organizational plan for keeping track of data from different experiments and projects in a structured set of directories and files. It is focused on bioinformatics students, so it emphasizes source code and programs more than experimental data or field notes, but the general strategy is applicable to many disciplines.

Class Recordings

Last modified 10 April 2022. Edits by Ross Whetten, Will Kohlway, & Maria Adonay.