_images/NC_State.gif

Discovering and Genotyping Genetic Variation

Global Overview

Different individuals of the same species often have slightly different DNA sequences in nuclear and organellar genomes. Single-nucleotide polymorphisms (SNPs) are differences of single nucleotide bases in DNA, and these may occur at frequencies ranging from less than 10-4 to greater than 10-2, depending on the genetic diversity of the species and the divergence among the samples being compared. Sequence variants may also involve multiple adjacent nucleotides (MNPs), and both SNPs and MNPs may be either substitution, insertion, or deletion events. In addition to variation at the level of the nucleotide sequence, structural variation such as inversions and translocations may also be present. The quality of an available reference sequence is an important contributor to analysis of structural variation in genomic sequencing data, and tools have been developed that are reported to work well with fragmented and incomplete draft assemblies (Wang et al., 2017).

Objective

The objective of this class session is to describe and practice methods for discovering and genotyping DNA sequence variation in samples drawn either from structured mating designs or from natural populations. The most cost-effective approach to this challenge, for species with large genomes (ie greater than a few hundred megabases) is reduced-representation sequencing, originally carried out using Sanger sequencing (Altschuler et al, 2000). For species with genomes of less than a few hundred megabases, including most microbes as well as many model species and some crop plants, low-coverage whole-genome sequencing can be used as a method of genotyping (e.g. Andolfatto et al, 2011; Au et al, 2013; Huang et al, 2010) A variety of methods exploiting high-throughput DNA sequencing platforms for reduced-representation strategies have been described in the past several years.

Description

A common strategy for reduced-representation sequencing is based on digestion of genomic DNA with one or two restriction enzymes, followed by ligation of adapters bearing oligonucleotide “barcode” sequences. Individual samples identified by different barcode sequences on the adapters can be pooled together for subsequent size-selection, amplification, and sequencing. The sequence data from different samples are then analyzed and assigned to the respective samples using the DNA sequences of the barcode section of the adapter. This method was developed in parallel by two different groups, and is known as Restriction-site Associated DNA Sequencing (RAD-seq, Baird et al 2008) by one group and as Genotyping By Sequencing (GBS, Elshire et al, 2011) by the other group. The original protocols for RAD-seq and GBS were different, as were the implementation strategies, but the two procedures have become more similar over time until now the distinction is more historical than actual. Several alternative methods for recovery of a specific subset of genomic fragment were reviewed by Mertes et al (2011); among these are hybridization capture, where synthetic oligonucleotide “bait” sequences are used to selectively enrich a sequencing library for target sequences of interest, and various PCR-based methods for selective amplification of a subset of target sequences. Such multiplex PCR methods have been refined to the point that over 1000 separate fragments can be amplified in a single reaction from a sample of human genomic DNA, and hundreds of independent samples can be processed in parallel (identified by unique adapter indexes) and sequenced in a single library (Hammet et al, 2019). Several commercial laboratories provide genotyping services, using either restriction-enzyme-based reduced representation, multiplex PCR, or hybridization capture methods.

Key Facts

Repetitive DNA is a significant challenge for restriction-enzyme-based reduced-representation sequencing methods, particularly for species without a reference genome sequence. The choice of restriction enzymes used to reduce genome complexity should be based on the desired numbers of loci, the frequency of restriction sites in repetitive DNA sequences (including organelle genomes), and the expected level of DNA sequence variation. In the absence of a reference genome sequence, empirical studies of the fraction of genomic DNA found in specific size classes in products from digestion with different single-digest and double-digest reactions can be used to estimate the yield of fragments in a double-digest RAD-seq or GBS experiment (Peterson et al, 2012). If a draft genome assembly of the target species or a related species is available, simulation can be used to compare the expected yield of restriction fragments from enzyme digestion with different single- or double-digest combinations of enzymes (Herten et al, 2015, Mora-Márquez et al, 2017, Rivera-Colón et al, 2020). Artifacts in identification of SNP and indel variants from short-read sequencing data are common, and different approaches have been taken to develop software that avoids some of these artifacts. Comparisons of different software packages on well-studied real datasets have reported differences among software tools in susceptibility to various types of artifacts (Li, 2014). Verdu et al (2016) reported results of a study to evaluate possible contributions of paralogous or repetitive sequences to a RAD/GBS experiment using the reads2SNPs software of Gayral et al (2013), and suggest strategies that may be useful to others dealing with similar challenges.

The Variant Call Format (VCF) file type is widely used to store information about locations of genetic variants, although it uses coordinates based on some sort of reference sequence as a means of identifying locations of variant positions. A common starting point for identification of genetic variant positions is a set of BAM files with alignments of sequence reads from several different individuals. Several software packages can analyze such a dataset to identify candidate genetic variants and save the results in VCF format, including BCFtools mpileup, Freebayes, and the Genome Analysis Took Kit (GATK). Formal specifications of SAM, BAM, and VCF file formats are maintained on Github at https://github.com/samtools/hts-specs.

Exercises

The RADseq.tgz archive containing the files for class can be downloaded from the link, or with the code below.

wget http://152.7.176.221/ExerciseData/archives/RADseq.tgz
  1. We will complete an exercise in analysis of a reduced-representation sequencing dataset from a linkage mapping population (two heterozygous outbred individuals and 93 F1 progeny) both without the use of a reference genome, and with the use of the reference genome. The RADseq.tgz archive contains an archive of bamfiles for the 95 samples and the reference sequence for LG2 the spotted gar genome for use in the exercise with a reference genome. The complete genome sequence is available either from NCBI, or from the Broad Institute; I used the Broad Institute version. Only LG2 is needed, because the sample dataset includes only loci that map to a single linkage group. The annotation for LG2 is in the same directory; this comes from the GFF3 file for the spotted gar genome (available from Ensembl), for use in identifying SNPs or other variants within or near annotated genes.

  1. Sample datasets are available for download for both the TASSEL GBS pipeline and for the STACKS software package. The TASSEL package is written in Java, and will run on the OpenJDK version available on most Linux distributions, as well as on Windows or Mac computers. TASSEL v5 requires Java 8, which is installed on the VCL machine image as the OpenJDK version, so those interested in using the most recent version of the TASSEL GBS pipeline can download and install that software package. The STACKS software package is written in C++, and can be compiled on Linux or on Mac OSX systems.

  1. After completing the de-novo and reference-based STACKS analyses (output archives for de-novo and reference-based can be downloaded with these links to save time, Direct Download links for SSH users), additional exercises with the R package vcfR and the command-line tool VCFtools are available. A complete script for all commands in the VCFtools tutorial is available, but you will learn more by going through the tutorial step-by-step. Notes from class are also available. These tools are useful in filtering, summarizing, and subsetting selected data from VCF files. Another option is to explore the use of Freebayes, an alternative program for calling SNPs from BAM alignment files for a set of samples. The BAM files used for the samtools mpileup exercise can also be used for a Freebayes run, and the output VCF files compared. To speed up the Freebayes analysis, use the –use-best-n-alleles 4 option to limit the number of possible alleles the program considers at each site. Freebayes uses a Bayesian approach that considers the data from all individuals in a population to identify variant sites in each individual, and will use a list of the 93 BAM files as input for genotyping much as the SAMtools mpileup program does. Type freebayes -h at a terminal prompt for detailed instructions on command-line options for Freebayes; the general form of the command to run Freebayes is
freebayes -L <bamfile.list filename> -f <reference FASTA file> -v progeny.vcf  --use-best-n-alleles 4.

  1. As with SAM and other file formats for genomic data, the VCF format specifies some columns that are mandatory and must contain particular kinds of data, and allows individual software developers considerable freedom to expand on these required fields by adding additional information. In VCF files, the variable fields are the INFO column (which contains summary data about a specific variant across all samples) and the FORMAT string (which specifies data that is available about a variant for each sample with non-missing data at that site) at each genotyped sample, as well as the columns (beginning with column 10) that contain data for each locus from individual samples. One of the vignettes for the vcfR package has a nice overview of the structure of VCF files, although the examples use R and the vcfR package and may not be useful for those unfamiliar with R.

  1. BAM files from alignment of human exome-capture sequencing data from eight samples to the hg19 reference assembly can be downloaded from this link, or using the command
wget http://152.7.176.221/ExerciseData/UConnExercise/bamfiles.tgz

These files were produced by the BWA mem aligner, incorporating read group information during the alignment process, and then processed further using SAMtools to fix mate-pairing problems, sort by coordinate position, mark duplicates, and remove unmapped reads, based on an exercise in variant calling with GATK from University of Connecticut. A VCF file of variants called from all eight samples using the GATK HaplotypeCaller, CombineGVCFs, and GenotypeGVCFs pipeline is also available with the command

wget http://152.7.176.221/ExerciseData/UConnExercise/combined.vcf.gz

Additional Resources

Other software packages for analysis of GBS/RAD-seq data have been reported, including Unified Network - Enabled Analysis Kit (UNEAK, Lu et al 2013), PyRAD (Eaton, 2014), and AftrRAD (Sovic et al, 2015). A key distinction among these is that in the original versions, some (PyRAD and AftrRAD) allow detection of insertion-deletion (indel) variants as well as substitution events, while others (UNEAK, TASSEL, and STACKS) only considered SNP events. Versions of STACKS after v1.38 (dated April 18, 2016) include the ability to do gapped alignments, and should therefore be able to detect indels in addition to SNPs. Similarly, TASSEL has moved completely to a reference-based analysis format that also allows detection of small indels. Note that a posting to the TASSEL Google group on Feb 12, 2015 announced that the UNEAK package for species without a reference genome available is no longer being developed.

_images/UNEAKnotSupported.png

Slides with an overview of GBS - by Keith Merrill

Papers:

  • STACKS 2: Analytical methods for paired-end sequencing improve RADseq-based population genomics. Rochette et al, Molecular Ecology 28:4737-4754, 2019. Version 2 of the STACKS package has tools for variant calling either from RADseq data de-novo (in the absence of a reference genome sequence assembly) or from RADseq data aligned to a reference assembly and provided to the program as BAM files. The ‘’’gstacks’’’ program identifies local haplotypes in BAM files (assuming that the target organism is diploid) using a Bayesian model that can accommodate indels as well as substitution variants.

  • Systematic comparison of variant calling pipelines using gold standard personal exome variants. Hwang et al, Scientific Reports 5:17875, 2015. This paper compares the results of variant calling using combinations of three different aligners (BWA-MEM, Bowtie2, or Novoalign) and four different variant callers (GATK HaplotypeCaller, BCFtools mpileup/call, Freebayes, and the Ion Proton Variant Caller) for the NA12878 “genome in a bottle”, using twelve different sets of sequencing data produced by different instruments (Illumina Hiseq2000 and Hiseq2500, and Ion Proton). This allows the authors to draw conclusions about which combination of software tools works best for SNPs vs indels, for each sequencing platform, and also compare the error spectra to identify software-specific error tendencies.

  • Towards population-scale long-read sequencing. De Coster, et al, Nat Rev Genet 22, 572–587 (2021). These authors report that long-read sequencing methods are reaching the point of cost-effectiveness for population studies, particularly for studies of pan-genome variation and structural variation that are difficult with short-read sequencing methods

Class Recordings

Last modified 12 March 2022. Edits by Ross Whetten, Will Kohlway, & Maria Adonay.