_images/NC_State.gif

Re-sequencing, Alignment, Structural Variation

Objective

The objective of this session is to introduce participants to re-sequencing of genomic DNA or cDNA produced from RNA. The term “re-sequencing” indicates that an assembled and annotated genome sequence is available for use as a reference in analysis of the sequence reads. This means that discovery of new sequence information is typically not the primary goal of these experiments. Instead, the sequence reads are used as a measure of some other biological property of interest. A variety of experimental methods have been developed that use massively-parallel DNA sequencing to measure specific aspects of genome structure or transcriptome activity: the accessibility of chromatin to digestion by nucleases, binding of specific proteins to DNA, three-dimensional interactions between chromosomes in the nucleus, and levels of gene expression are some examples of properties that can be measured in this way. After data preprocessing and quality control, alignment of reads to the reference genome sequence provides results in the form of Sequence Alignment and Mapping (SAM/BAM) format alignment files, which are then processed by additional software tools to produce the measurements of experimental interest.

Description

Alignment of short DNA sequence reads to a genome reference sequence can be done by several approaches. One is to produce a “hash table”, either of the sequence reads or of the reference genome sequence. A hash table consists of key - value pairs, where the key is a short DNA sequence (a k-mer) and the value is the number of times that sequence occurs in the reference genome or the sequence reads. This hash table can then be used to compare sequence reads to the genome to identify regions of the genome that most closely match the DNA sequence in each read. An alternative approach relies on a mathematical transformation called a Burrows-Wheeler Transform, which allows construction of an index of the reference genome sequence. This index is then used to identify the genomic location to which each sequence is most similar. We will not delve into the mathematics of these methods; see the Wikipedia page on short read alignment software for links to many different alignment programs.

Key Facts

The length and number of repetitive sequences in a particular genome determine the sequencing strategy to produce data likely to allow a relatively complete assembly, and also help determine the best strategy for detecting variation among individuals in genome sequence. Short-read sequencing is very useful for detecting substitutions or insertion/deletion events (indels) of one to a few basepairs, but is not as sensitive for detecting larger indels (>50 bp), or rearrangments such as inversions, translocations, or duplications. Long-read sequencing has much greater sensitivity for detecting these larger-scale phenomena, but is still a relatively new approach for structural analysis of genomes of higher eukaryotes because of the higher relative cost of long-read methods relative to Illumina sequencing. This cost structure is changing with the increases in the accuracy and yield of reads from PacBio and Oxford Nanopore Technologies instruments. A recent comparison of software designed to detect structural variation in short read resequencing data (Cameron et al, 2019) concluded that programs designed to do local reassembly of reads around regions of putative structural variation give better results than programs based on other algoriths. An in-depth study in Drosophila comparing the number of structural variants caused by transposable elements using short-read resequencing data versus long-read de-novo genome assembly reported that de-novo genome assembly detected hundreds more variant sites in regions that could be recognized as comparable between the reference assembly and their two new assemblies (Ellison and Cao, 2019). While comprehensive analysis aimed at identifying all structural variants among a group of genomes will probably be best done by de-novo genome assembly with long reads and other data types, it may be possible to survey a subset of structural variation more cost-effectively using short read data given the appropriate data types and software tools. Linked-reads, as produced from libraries made using 10x Genomics Chromium method, the BGI stSLR method, Universal Sequencing’s TELL-Seq kit, or the method of Redin et al (2019), can be used to detect structural variation, with advantages relative to standard short read data in sensitivity and relative to long-read sequencing in cost-effectiveness (Elyanow et al, 2018, Zhang et al, 2020)

Variant discovery/genotyping

  • SNPs vs structural variants: SNPs are usually more abundant in terms of numbers of sites, but indels and rearrangement events can affect more nucleotides genome-wide. A report in Nature Genetics (Chiang et al, 2017) suggests that structural variants often have larger effects on relative expression levels of nearby genes than do SNPs.

  • Common alleles vs rare: common alleles are easier to detect and effects can be estimated accurately; rare alleles can be so hard to find that specific strategies are needed to identify sufficient numbers of individuals to provide statistical power to estimate SNP effects

  • In-read barcoding of individual samples, as is common in GBS or RAD-seq approaches, works well for genotyping specific individuals at SNPs with common alleles: sequencing pooled samples without individual IDs works well for identifying rare variants and estimating allele frequencies.

  • Many software packages are available for identifying structural variants in alignments of short-read (usually Illumina) sequence data to a reference genome (Alkan et al, 2011). Layer et al (2014) describe Lumpy, a program that detects structural variation in whole-genome sequencing data by synthesizing information from different kinds of read alignment results - split reads (where a single read aligns to two different locations), discordant read pairs (where the paired-end reads from a single fragment align to locations inconsistent with that expected based on the fragment sizes in the sequencing library), or differences in read depth (due perhaps to variation in copy number of a particular sequence in the sample genome relative to the reference). A more recent publication by Becker et al (2018) describes a Python tool for structural variant detection that uses an ensemble of different structural variant detection programs - BreakDancer (Chen et al, 2009, BreakSeq (Lam et al, 2010, cnMOPS (Klambauer et al, 2012), CNVator (Abyzov et al, 2011), Delly (Rausch et al, 2012), Hydra (Quinlan et al, 2010), and Lumpy. An extension of this approach (Zarate et al, BioRxiv 2018) uses parallelization to run some or all of the SV callers BreakDancer, BreakSeq, CNVnator, Delly, Lumpy, and Manta (Chen et al, 2016), on a multi-core computer in roughly the same time required to run a single program. Two comparison of results from different software tools have recently been published; one reported results from 69 different packages (Kosugi et al, 2019), while another comparison selected a subset of 10 programs based on different algorithms or combinations of algorithms (Cameron et al, 2019). The latter paper reports, perhaps not surprisingly, that one of the better programs for detection of structural variants in Illumina short read data is Gridss, a software package previously described by the same research group (Cameron et al, 2017)

  • Efficient analysis of data from population-level whole-genome sequencing projects in humans is essential, due to the size of the human genome and the computational intensity of sequence alignment and variant detection. Johnston et al (2017) describe two new programs, PEMapper and PECaller, to improve the efficiency of these analyses. Smith et al (2017), describe the Genome Rearrangement Omni Mapper (GROM) program to efficiently identify all classes of sequence variation including SNPs, structural variants, indels, and copy-number variants.

  • Firtina & Alkan (2016) report that changing the order of sequence reads in a FASTQ file, followed by alignment of the reads to the human reference genome and SNP calling with various software tools shows discordance in identified variants, ranging from less than 1% to around 25%. This observation suggests that any variant-calling routine should be tested for sensitivity to read order in the input FASTQ files, although for large datasets this repeatability analysis would be extremely time-consuming.

  • Variant normalization, or reconciling differences in alignments and SNP or indel calls in low-complexity regions, is an important topic for researchers interested in making a complete catalog of genetic variation in a population. A seminal paper by Tan et al (2015) introduced software for variant normalization using a VCF file and the reference sequence. Bayat et al (2017) reported additional concerns regarding the normalization algorithm used by other software and offered another software package they believe does a better job.

Exercises

  1. Data suitable for Gridss analysis, derived from a Drosophila melanogaster resequencing experiment, are available for download: reference genome archive, blacklist file, and reads.bam file (or if you are having issues downloading the bam file use this link) of Illumina reads aligned to the reference genome. A text file with command line code for the downloads is also available. These reads are from Sequence Read Archive accession SRR2033228 - the reason for providing the BAM file rather than the raw read data is to save the considerable time required to align the reads to the reference genome. Notes are also available describing the steps in preparing the BAM file and running Gridss.

  1. Erik Garrison, lead author of the Freebayes SNP caller, has an exercise in alignment and variant calling on his Github page, using both E. coli and human datasets downloaded from web sources. For class please use the ecoli_variants.sh script file. We will work through the script step by step, to trouble-shoot errors as they arise. Most of the software required for this tutorial is installed in the VCL machine image, in the ‘bioinfo’ conda environment; the ‘mutatrix’ program gave compile errors so it is not installed, and the tools listed below sra_toolkit are also not installed. A few comments about the tutorial:
  • The ‘fastq-dump’ command to download data from the Sequence Read Archive still exists, but a faster version (called ‘fasterq-dump’) is now available. You will have to run the ‘vdb-config -i’ command in the bioinfo conda environment to turn off the file cache option before you can run either the fastq-dump or fasterq-dump programs. Remember that the tab key moves the cursor from one item to the next in the vdb-config program, so press Tab until CACHE is highlighted, then hit the Enter key to select that item, and hit Tab until ‘[] enable local file caching’ is highligted. Use the space bar to deselect (the X goes away), then use Tab to move back to the top row until Exit is highlighted, and hit Enter three times to exit and save changes.
  • The modified version of the exercise uses different files of E. coli reads downloaded from SRA to better demonstrate variant calling. Garrison’s tutorial aligns E. coli strain K12 reads to the E. coli K12 reference genome assembly, and (not suprisingly) very few variant loci are discovered. Freebayes runs on a single core, so the process of calling variants on four E. coli isolates takes a long time. An output file from that analysis is saved in ecoli.vcf.gz. An alternative is available - the freebayes-parallel script (installed as part of the Freebayes package) breaks the variant-calling job up into smaller pieces and uses the GNU Parallel utility to run each piece in a separate thread, which speeds things up a lot. There is a minor bug, however, which needs to be corrected in order for this to work on the VCL instance.
  • A set of slides from a presentation Garrison gave in 2015 describing the Freebayes SNP caller and how it is used in the 1000 Genomes project exploring human genome diversity are also available.

  1. Data from an exercise presented at a 2016 Canadian bioinformatics workshop are in Module3.tar.gz.

  • Web pages describe the steps required to align samples of reads from normal and tumor samples to a reference human genome sequence, then analyze the resulting alignments to identify rearrangements. Before executing this exercise, you must create a “virtual environment” in which you install Python v2.7 and the numpy (Numerical Python) module, because the Lumpy program relies on those dependencies. Create the virtual environment with

conda create --name=py2 python=2.7

  • you will have to respond to a question confirming the installation. After creation of the virtual environment is complete, activate the environment using

source activate py2

  • you will see that the prompt changes to include (py2), so you can tell from the terminal prompt which virtual environment is in use. Install the numpy module in the terminal window running the virtual environment, using

conda install numpy

  • this will also ask for confirmation. Normally the creation of a virtualenv and installation of modules would only need to be done once, but because everything in the home directory is lost when a VCL instance is shut down, these steps must be repeated with each new instance that is started.

3.1. Download a shell script that will carry out the commands given in the webpages linked above, edited to reflect the differences in file paths and configuration of the VCL instance using

wget http://152.7.176.221/bit815/ResequencingAlignmentStructuralVariation/module3.sh

IMPORTANT NOTES

The full human reference genome sequence is too large to work in the alignment step, so after you unpack the Module3.tar.gz archive, you will need to change into the Module3 directory and extract the reference sequences for chromosomes 3, 6, 9 and 12 (because those have rearrangements on them) to a new file. The module3.sh script includes the command
bioawk -cfastx '{if($name==3 || $name==6 || $name==9 || $name==12) {print ">"$name"\n"$seq}}' human_g1k_v37.fasta | fold | gzip > chr36912.fa.gz

to do this extraction; other ways are possible as well. The process will take a few minutes, so don’t assume that something is wrong if you don’t get a terminal prompt back right away after entering this command.

After extracting the subset of 4 chromomes from the complete reference genome, the script will delete the files related to the complete human genome reference sequence and index files, to free up disk space. The next step is to create a BWA index before aligning reads to the four chromosomes of interest. The script uses the command

bwa index -p subset chr36912.fa.gz

to create an index with the name ‘subset’. This will take several minutes, so don’t be impatient.

3.2. Map the normal tissue-derived and tumor-derived reads back to the reference genome sequence, piping the SAM-format output from the BWA mem aligner to samtools sort to sort the BAM file by reference position so alignment viewers can efficiently display the resulting alignments. The module3.sh script uses the following command line:

bwa mem -t8 -p subset reads.tumour.fastq | samtools sort -o tumour.bam -

The alignment will take a few minutes for the tumor-derived reads. A modified version of the same command is used to align the normal-tissue-derived reads to the same reference, convert the output to BAM, and sort the output BAM file. After both BAM files are complete, the script uses the samtools index command to produce index files for each of them. If you don’t know how to use the samtools index command (and no one is born knowing this sort of thing), try typing samtools index -h at a terminal prompt to see what information is available, or do a Google search.

3.3. The command to produce files of discordant reads from the BAM alignments uses the “flag” column of SAM format, which is a numerical value that contains answers for 12 different yes-or-no questions. The Explain SAM flags web page has a list of the 12 properties of reads that make up the flag value; if the value 1294 is entered in the box, the corresponding properties of the reads are identified. The samtools view -F1294 option means “do not show reads with flags containing any of these values”, effectively excluding reads with the checked characteristics from the ouput.

3.4. The command to produce files of split reads uses a script called extractSplitReads_BwaMem in the scripts subdirectory of the Module3 directory - make sure you use the correct path when you try to execute this command, and pay attention to the permissions on the files in the scripts subdirectory. How can you change the permissions to allow execution of all those script files?

3.5. The LUMPY program is not installed in the VCL machine image, so those who wish to complete the part of the exercise that requires Lumpy will need to install it from the Github page. The installation instructions conclude with copying all the executable files to the /usr/local/bin directory, so you can execute those programs without concern about what path to use to the program. The paths to the input files, and the names of the input files, however, must match those present on your instance of the machine image.

Additional Resources

  • Information on the Sequence Alignment and Mapping (SAM) format is available at a University of Michigan wiki, at Dave’s Wiki, and in the SAM format specification.

  • Quality control of alignment files is a valuable preliminary step before investing significant time and effort in analysis. A package called indexcov is available to efficiently summarize coverage of different genomic regions within a single sample, or uniformity of coverage across multiple samples, beginning with alignments in BAM or CRAM formats. See Indexcov: fast coverage quality control for whole-genome sequencing. GigaScience 6:1-6, 2017

  • Genomic rearrangements in Arabidopsis considered as quantitative traits. Imprialou et al, Genetics 205:1425-1441, 2017. This paper describes a strategy for mapping likely locations of structural rearrangements in a segregating population of recombinant inbred lines using low-coverage (0.3x) whole-genome resequencing.

  • CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Abyzov et al, Genome Research 21: 974-984, 2011.