Transcriptome Analysis: Differential Gene Expression and Annotation¶

Global Overview¶

Transcriptome analysis is a very broad topic that covers a spectrum from initial characterization of expressed genes in a non-model species with no other genomic information available, to detailed analysis of alternative splicing and gene expression among organs, tissues, or even individual cells of a model organism for which a well-annotated reference genome sequence is known. As previously noted, if the objective of a sequencing experiment is simply discovery of the sequence itself, then experimental design considerations may be less critical, but if the objective is to use the sequencing experiment as a quantitative measure of some biological process (such as gene expression or alternative splicing differences among individuals, developmental stages, or treatments), then an appropriate experimental design is essential.

Objective¶

The objective of this session is to introduce participants to the additional complexity of analyzing transcriptomes by deep sequencing, above the already complex process of analyzing genome structure. RNA transcripts vary both in abundance, in primary sequence, and in the outcome of splicing processes. All these types of variation can have important biological effects, and may be of interest in a “transcriptomics” experiment. Annotation is the process of describing functional elements in genomes, including regions that are transcribed and processed into various kinds of functional RNA molecules. Messenger RNAs that encode proteins are an abundant class of functional RNA molecules, but by no means the only one - long non-coding RNAs, microRNAs, ribosomal RNAs and transfer RNAs are other important classes of functional RNA molecules.

Description¶

RNA-seq experiments are growing in popularity as a means of characterizing the transcriptome, detecting alternative splicing events, and measuring differences in gene expression between samples of different types. The importance of good experimental design in conducting RNA-seq experiments is emphasized in the first paper in the recommended reading, by Auer and Doerge. Any experiment in which differences between samples are to be interpreted in a biological context should take seriously the need for good experimental design. The most reliable conclusions will result from a well-replicated design in which the experimental treatments are orthogonal to nuisance factors. Slides with an overview of RNA-seq workflows and some discussion of experimental design and analysis strategies are available. Engström et al (2013) compared the performance of several different programs designed to align RNA-seq reads to reference genome sequences, and provide a thorough discussion of the advantages and disadvantages of the programs tested. More recently, Conesa et al (2016) surveyed best practices for RNA-seq experiments, and noted that no single workflow is optimal for every experiment.

One exercise in RNA-seq data analysis will follow the description in the EdgeR user’s guide or the DESeq2 vignette on your own or in class. The exercise is based on an experiment reported by Cumbie et al. (2011), and involves comparison of gene expression levels in Arabidopsis plants inoculated with a bacterial pathogen or mock-inoculated with sterile solution. The complete data from the experiment are downloaded from NCBI SRA during the exercise using the At_RNAseq.sh script saved in the AtRNAseq archive; R scripts to run differential gene expression analysis with DESeq2 or edgeR packages are in the same directory.

Key Facts¶

RNA-Seq library construction strategies may be different for different experimental objectives:

Differential gene expression: one sequenced “tag region” per gene is enough to estimate relative levels of gene expression if a reference genome sequence is available to allow mapping tags uniquely to genes, but “tag sequencing” does not give information on alternate splicing.

Gene discovery: comprehensive coverage of transcripts is an asset to obtain complete sequences of expressed genes in species for which a reference genome sequence is not available. Normalization methods that reduce the difference in abundance among transcripts can work well to obtain more complete coverage of all transcripts, but may be a problem if accurate estimates of the relative abundance of transcripts is an experimental objective. Paired-end sequencing is useful for transcriptome assembly, because it provides more information about alternative splicing events and transcript structure for the assembly process.

Alternative splicing analysis: complete coverage of exons is essential, and estimates of relative abundance are important also. Long-read sequencing methods (PacBio and Oxford Nanopore) are becoming the method of choice for analysis of alternative splicing events because they allow analysis of combinations of alternative splicing events across the entire length of the mature transcript. This advantage is most important in studies of genes in which multiple sites of alternative splicing are known, and a central question of interest is which events occur in the same transcript.

Often researchers are interested in all aspects of transcriptome analysis – discovering new transcripts or alternate splicing events of annotated genes, plus measuring relative abundance and detecting genetic variation – so many RNA-Seq experiments use non-normalized libraries of cDNA synthesized by priming with random oligos, to give relatively uniform coverage across entire transcripts. Accurate reconstruction of alternatively-spliced transcripts from individual genes is an important part of RNA-seq data analysis if the experimental objectives include testing for significant differences in levels of alternatively-spliced transcripts among individuals (genetic variation in splicing) or among treatment groups (which may include developmental stages as well as environmental conditions or chemical exposures). Software designed to test for association between genetic variants and levels of alternatively-spliced transcripts is available (Monlong et al, 2014). Pipeline tools that combine multiple software packages into an integrated analytical approach have been described by multiple groups (Cumbie et al, 2011 and Varet et al., 2015 are examples); these may be worth setting up if you have a lot of data to analyze and want the added functionality of an integrated pipeline.

One disadvantage of using an integrated pipeline can be that the details of individual steps in the analysis are obscured, unless the end user can actually read the code and understand what each step of the pipeline does. This can make it difficult to know exactly what analytical routines are used, or how appropriate they are to the user’s dataset. This can be especially important in dealing with sources of bias, such as transcript length and GC content, that can affect the results of differential expression analysis from RNA-seq experiments. Mandelboum et al (PLoS Biol 17:e3000481, 2019) reported that sample-specific variation in length effects resulted in biased results in 35 different datasets downloaded from the GEO database at NCBI, detected as a rank correlation between the log-fold change in expression of a transcript and the length of the transcript, even when comparing replicate samples of the same experimental treatment. Many current protocols for differential gene expression analysis include normalization for transcript length, but testing for a relationship between log-fold change in expression and transcript characteristics is a good control to test for possible bias. Hansen et al (Biostatistics 13:204, 2012) described conditional quantile normalization to reduce bias due to GC content or transcript length, and this method was reported by Mandelboum et al to mitigate the bias they observed in the published datasets they analyzed.

Exercises¶

You can use the commands found in the Transcriptome_data.txt to download the required archives directly from the command line.

The experimental data from Cumbie et al 2011 is saved in the AtRNAseq archive. The files c1.fq.gz, c2.fq.gz, and c3.fq.gz contain sequence reads from three biological replicates of control samples, and files t1.fq.gz, t2.fq.gz and t3.fq.gz contain sequence reads from three biological replicates of test samples. The file Atchromo5.fasta.gz contains the sequence of Arabidopsis chromosome 5, and the file TAIR10.cDNA.fa.gz contains predicted transcripts from the TAIR10 Arabidopsis genome assembly. Matrices produced by Salmon using k-mer-based “quasi-mapping” of read counts or read TPM values are available; these can be imported into DESeq2 or edgeR sessions and analyzed according to the vignettes for those packages (links provided above or with direct download text).

A script to download the complete data from the Sequence Read Archive at NCBI is At_RNAseq.sh, and scripts to analyze the resulting data with kallisto and either edgeR and DESeq2 are also available - these scripts are saved in the AtRNAseq archive from the google team drive, and links are provided here for those who want to try the analysis on other machines. A script to load output from RSEM into R for analysis is also available.

A fairly comprehensive discussion of RNA-seq workflow options (including different approaches to producing tables of read counts from BAM alignment files) is available in a Bioconductor tutorial on gene-level exploratory data analysis; a description of using biomaRt, GO, and KEGG for annotation is given in this tutorial.

Another good overview of RNA-seq analysis is RNAseq Analysis in R, which contains materials (both lecture slides and hands-on computing exercises) for a multi-day workshop. The materials include visualization using heat maps, volcano plots, clustering, and a variety of other methods, using example data from mouse to take advantage of the available annotation to do gene set enrichment analysis.

The “Tuxedo” package of programs (Bowtie2, Tophat, Cufflinks) provide splice-aware read alignment, transcript reconstruction, and estimation of transcript abundance. The latest versions of Bowtie2, Tophat, and Cufflinks are available as compiled executables, and those version can read and write gzipped files. Simply download and unpack the archives for each program, then create a symbolic link between the program and the /usr/local/bin directory

A complete tutorial for analysis of RNA-seq data using Tophat and Cufflinks is available in Trapnell et al (2012); this can be used as a guide to carry out analysis of the control and test datasets used for the RNA-seq exercise described above.

An older tutorial for Gene Ontology (GO) Term Enrichment from RNAseq analysis. The tutorial from a Cold Spring Harbor Plant Biology short corse contains information on the overview of the GO term enrichment and notes on the sampling procedure used to shrink the dataset from the full NCBI record. For more information the Gene Ontology page has links to the annotation tables of various organisms. Additionally, a vignette for the goseq package for GO Term Enrichment using v3.4 of Bioconductor is also available.

An exercise in evaluating the influence of total read depth on the sensitivity and precision of detecting genes using RNA-seq data is available. The download.bamfiles.sh script will download a set of BAM files from Google Drive and run the Hisat2 reference-guided transcript assembler on the files, then compare the GTF files output from the Hisat2 runs with the TAIR10 Arabidopsis reference genome assembly annotation to assess the sensitivity and precision of detecting annotated genes from RNA-seq data. These data are from the study of Marquez et al, 2012 - the BAM files contain only reads aligning to chromosome 2 of the Arabidopsis genome, and were subsampled to represent 20%, 40%, 60%, and 80% of all the available data, as well as the complete set of reads aligned to chr2. A set of questions to answer about the output from the experiment is also available. Important sources of information include the Stringtie manual and the GFFcompare manual webpages.

An exercise on annotation of assembled transcripts (either reference-guided or de-novo assembled) with TransDecoder is available. The annotation.sh script file picks up where Exercise 8 ends, and assumes that the GTF file produced by Stringtie reference-guided assembly of RNA-seq reads is available. Some editing of the script will be necessary to make sure the file names and paths are correct for the files you are using for the annotation exercise. The text file Trinotate_Bioconda_install.txt has information on how to install the complete Trinotate annotation pipeline using Bioconda - this may work on the VCL image, but will be most useful on the HPC because it has more resources available to actually process large datasets.

Additional Resources¶

Statistical design and analysis of RNA sequencing data. Auer & Doerge, Genetics 185(2):405-416, 2010.

Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias. Mandelboum et al, PLoS Biol 17: e3000481, 2019. Transcript length is correlated with log-fold change in expression levels in 35 published RNA-seq datasets after normalization with five different methods, according to these authors. They recommend conditional quantile normalization as a way to reduce the bias due to differences in transcript length and GC content.

BAM alignment files are not the only way to estimate the number of transcripts from each gene detected in an RNA-seq dataset; an alternative approach is to create a k-mer hash table of the transcripts that might be detected, then use that table to analyze the filtered and trimmed reads themselves to estimate the count of reads from each transcript, and therefore the counts for each transcript detected. Software tools to carry out this type of transcript-count estimation include Sailfish, Salmon, Kallisto, and HTSeq.

DE-kupl: exhaustive capture of biological variation in RNA-seq data through k-mer decomposition Audoux et al, Genome Biol 18:243, 2017. These authors developed the DE-kupl software tool for reference-free transcriptome analysis using kmer decomposition of RNA-seq sequencing reads. This can be a powerful tool for analysis of organisms with no previous genomic information, or for detection of novel events in species (such as human) with well-annotated genome and transcriptome data.

Ultrafast functional profiling of RNA-seq data for nonmodel organisms Liu et al, Genome Research 31: 713-720, 2021 These authors developed another reference-free kmer-based analysis tool for RNA-seq data, called Seq2Fun. The Seq2Fun pipeline translates RNA-seq reads into all six reading frames and searches databases of peptide sequences to identify homologous proteins, and produces output including transcript abundance tables, biochemical pathway information, and species of origin. The output from Seq2Fun can be used as input to NetworkAnalyst to carry out Gene Ontology (GO) and KEGG annotation and pathway analysis.

Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies. Li et al, BMC Genomics 21:75, 2020. These authors compare different normalization methods and statistical tests for sensitivity and specificity in analysis of simulated RNA-seq datasets, where the correct answer is known, and report that different methods give optimal results depending on the experimental design.

Interpretation of differential gene expression results of RNA-seq data: review and integration McDermaid et al, Briefings inBioinformatics 20:2044-2054, 2019. The authors describe a new Bioconductor package for visualization of differential gene expression results using R, to simplify production of a variety of graphic outputs from results of three widely-used analysis packages

CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Pertea et al, Genome Biol 19:208, 2018. These authors report the discovery of 224 novel protein-coding genes and 116,156 novel transcripts in the human genome, in additional to millons of transcripts they hypothesize are transcriptional ‘noise’. See also the Research Highlight by W.F. Doolittle in the same issue of the journal, which discusses the definition of gene function as “honed by natural selection in order to contribute to organismal fitness”, and the alternative perspective that suggests that transcribed regions of the genome must have a function because they are transcribed.

Protocol update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0) Mi, et al.Nat Protoc 14, 703–721 (2019). PANTHER is a database of gene orthologues with Gene Ontology classication and assignments to biochemical pathways, searchable by several methods to allow researchers to recover annotation information important for providing biological context to the results of transcriptome analysis experiments.

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Huang et al, Nature Protocols 4: 44-57, 2009

Identification of genetic variants associated with alternative splicing using sQTLseekeR. Monlong et al, Nature Comm 5:4698, 2014

Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Busby et al, Bioinformatics 29:656–657, 2013

Systematic evaluation of spliced alignment programs for RNA-seq data. Engström et al, Nature Methods 10:1185-1191, 2013. This paper reports results of comparisons of several different splice-aware alignment programs, and concludes that none of the programs tested is optimal by all criteria. The STAR alignment program (Dobin et al, 2013; see next reference) ranks highly by most measures, though, and is recommended for use by the Broad Institute as part of their Best Practices pipeline for variant discovery in RNA-Seq experiments.

STAR: ultrafast universal RNA-seq aligner. Dobin et al, Bioinformatics 29:15-21, 2013

A survey of best practices for RNA-seq data analysis. Conesa et al, Genome Biology 17:13, 2016

GENE-counter: a computational pipeline for the analysis of RNA-seq data for gene expression differences. Cumbie et al, PLoS ONE 6(10): e25279, 2011.

Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations. Fu et al, PNAS 111:1891–1896, 2014

Robust adjustment of sequence tag abundance. Baumann & Doerge, Bioinformatics 30(5):601-605, 2014

Differential analysis of gene regulation at transcript resolution with RNA-seq. Trapnell et al, Nat Biotechnol 31:46-53, 2013

Improving RNA-Seq expression estimates by correcting for fragment bias. Roberts et al, Genome Biol 12:R22, 2011

Class Recordings¶

Session 20: recorded March 10th 2021. A trancscript of the recording is also availabile.
Session 21: recorded March 12th 2021. A trancscript of the recording is also availabile.
Session 22: recorded March 15th 2021. A trancscript of the recording is also availabile.
Session 23: recorded March 17th 2021. A trancscript of the recording is also availabile.
Session 24: recorded March 19th 2021. A trancscript of the recording is also availabile.

Last modified 13 April 2022. Edits by Ross Whetten, Will Kohlway, & Maria Adonay.