Transcriptome Assembly¶

Global Overview¶

Multiple options are available for reconstruction of the sequences of RNA transcripts based on analysis of complementary DNA copies, depending on whether a high-quality reference genome assembly is available. If so, reference-guided transcriptome assembly using a splice-aware sequencer aligner followed by resolution of alternative splicing variants and export of consensus transcript sequences can be powerful (Trapnell et al., 2012). If there is no reference genome assembly, or the available assembly is fragmented and poorly-annotated, a de-novo assembly of putative transcripts from short-read sequences is one alternative. The availability of long-read sequencing methods such as Oxford Nanopore Technologies and Pacific Biosciences has opened a third alternative, which is to obtain full-length sequences of cDNAs and report those sequences without the requirement for assembly at all (Sharon et al., 2013; Bolisetty et al., 2015; Minoche et al., 2015). The key disadvantage to long-read sequencing is that it often fails to sample deeply enough to capture full-length copies of rare transcripts, so transcriptomes based solely on long-read sequencing may be lacking a full complement of genes expressed at low levels or in a limited number of cell types. A number of publications have reported that more complete and accurate transcriptome assemblies can be produced from a combination of read types and assemblers than from any single assembler or read type (Gilbert 2013; Nakasugi et al 2014; McManes 2018; Venturini et al 2018; Gilbert 2019). The variation in length and transcript abundance of RNAs, combined with the additional complexity contributed by multi-gene families and alternative splicing events, means that no single assembly algorithm or set of parameters can be optimal for every transcript. In all cases described so far, merging together of assemblies produced by different algorithms, or the same algorithm but different parameters, has produced better results than any single assembly from an individual assembler run with a single set of parameters.

Objective¶

The objective of this session is to introduce participants to methods for assembly of a transcriptome, or a collection of RNA transcripts present in a given RNA sample. The RNA-seq data used in the differential gene analysis exercise will be used for this exercise, along with other datasets that are small enough to run on systems with only 32 Gb RAM.

Description¶

De-novo assembly of short DNA sequence reads into contiguous full-length copies of RNA transcripts is a complex process, because there are multiple sources of variation in read coverage and in read sequences. Read coverage is proportional to transcript abundance, but can also be affected by biases due to library construction methods and biological factors. A key factor leading to variation in transcript sequences, and therefore variation in read sequences, is alternative splicing of transcripts from the same transcription unit into different mature forms. Another factor affecting read sequences is the increased error rate of RNA polymerases relative to DNA polymerases - RNA sequences are inherently noisier than DNA sequences. Trinity (Grabherr et al, 2011) is a widely-used suite of programs for de-novo assembly of RNA-seq data. One key difference among assembly programs is the amount of memory required - the recommended amount of RAM for Trinity is 1 Gb RAM per million reads in the dataset, while other assemblers can use much less, although different datasets show different requirements in terms of both memory and run time (Hölzer & Marz 2019).

Construction of a de Bruijn graph is one approach used to solve the computational problem of assembling short DNA sequence reads into accurate contigs that reflect the transcripts present in the original RNA sample used for library construction. A key difference among assemblers that affects the amount of memory required is how the de Bruijn graph is built and stored, and whether the entire de Bruijn graph of all k-mers present in the full set of reads is held in memory for the entire process of assembly. Slides from Lilian Matallana’s lecture on de Bruijn graphs and their use in sequence assembly programs describe these different strategies in more detail.

Key Facts¶

Paired-end sequencing reads are useful for assembly of eukaryotic transcriptomes, because the information they provide about the positions of sequences relative to each other within a transcript is valuable for correct assembly of alternatively-spliced transcripts. In bacteria, splicing is less frequent, alternative splicing can be less of a concern, and single-end reads can often be assembled into a reasonably accurate transcriptome. If alternative splicing is not of interest, then single-end reads are useful for RNA-seq analysis of eukaryotic transcriptomes as well - now that read-lengths of 100 to 150 nt are readily available, paired-end reads add little additional information regarding levels of gene expression.

An important step in the process of transcriptome assembly is evaluation of the quality of the assembly produced. The EvidentialGene pipeline (often shortened to Evigene) produces summary statistics describing the proportion of contigs in the final assembly that show complete or partial similarity to known proteins in the databases used for comparison, but most assemblers don’t provide this information. The authors of the Trinity assembler also provide a package of programs called Trinotate, which is designed to identify open reading frames in contigs, translate these to predicted polypeptide sequences, and characterize those polypeptides by comparison with public databases to produce an output file of annotation, but evaluation of transcriptome assembly quality is not a primary objective of this package. The program rnaQUAST is designed to compare one or more de-novo transcriptome assemblies to a reference assembly and annotation file as a means of comparing the de-novo assemblies. In the absence of an annotated reference genome assembly, the rnaQUAST program can be integrated with the output of GeneMarkS-T and BUSCO to evaluate the completeness of a de-novo transcriptome assembly. The BUSCO package tests for the presence of a limited set of putative “conserved single-copy orthologues”, genes expected to be present in all organisms from a specific taxonomic group, and provides databases of such genes for a variety of taxonomic groups.

Exercise - reference-guided assembly¶

Direct download links for class can be found in the Transcriptome_Assembly.txt file.

The Arabidopsis thaliana RNA-seq dataset used for the analysis of differential gene expression includes six sequence files found in the fullset.zip archive in the AtRNAseq archive.

The HiSat2 aligner is already installed on the VCL system. If you want to install it on another computer, an executable binary can be downloaded from the link under the Releases heading on the right side of the program home page. This program is the successor to Bowtie and Tophat, the original programs developed for reference-guided sequence assembly.

Create a folder for this exercise, and unpack the Atchromo5.fasta.gz reference sequenceinto that directory. Use the hisat2-build program (from the hisat2-2.0.5 directory) to build an alignment index from the uncompressed fasta-format sequence data.

Unpack the three control-sample RNA-seq read files (named c1, c2 and c3) from the fullset.zip archive to a single fastq file, using the gzip -cd command with file globs.

Align the fastq-format sequence reads to the reference sequence using the hisat2 program (using the –dta option), and pipe the output to samtools view to convert the SAM output into BAM, then to samtools sort to sort the output BAM data and save it to a file.

Stringtie is already installed on VCL instances. If you are not working from the VCL you can download and install the Linux x86_64 binary version of the StringTie program from the link under the Obtaining and installing StringTie heading on the program home page.

Execute the stringtie program using the sorted BAM file as input.

Notes from class for reference guided assembly.

Exercise - de-novo assembly¶

The Arabidopsis thaliana RNA-seq dataset can also be used for de-novo assembly, although it is comprised of short, single-end reads so it is not ideal. The 32 Gb of RAM available on instances of the VCL machine image is a limiting factor. Disk storage space is another important limiting factor on the VCL image; it has only about 7 Gb of available space, which is not enough to store the input RNA-seq reads and have room for the assembler to write temporary files and output. The BITfileserver contains three sets of paired-end reads of Drosophila melanogaster RNA-seq data from different developmental stages; these can be used for a trial assembly using the Trans-ABySS transcriptome assembler (available in the ‘bioinfo’ conda environment). Download the data using a bash loop such as:
```
for N in adult embryo larva;
do wget http://152.7.176.221/bit815/TranscriptomeAssembly/${N}_1.fq.gz;
wget http://152.7.176.221/bit815/TranscriptomeAssembly/${N}_2.fq.gz;
done
```

The code to quality-trim and adapter-clip the read files and run the Trans-ABySS assembler is relatively brief, but reading the software manuals to understand the function of the command-line options is important. Example commands are provided to quality-filter and adapter-trim three files of Drosophila RNA seq-data, using a bash loop to process each of the three input RNA-seq datasets in series. After filtering is complete, the original files can be deleted to free disk space, followed by a one-line command to run the Trans-ABySS assembler:
```
source load_conda
conda activate bioinfo

for file in adult embryo larva;
do bbduk.sh in=${file}_1.fq.gz in2=${file}_2.fq.gz out=${file}.fq.gz ref=adapters mink=11 ktrim=r qtrim=rl minlength=50 maxns=0 trimpolya=15;
done
rm adult_1.fq.gz adult_2.fq.gz embryo_1.fq.gz embryo_2.fq.gz larva_1.fq.gz larva_2.fq.gz
transabyss --pe adult.fq.gz embryo.fq.gz larva.fq.gz --outdir ~/out --name flyRNA --length 200 --threads 15 -k 35
```

This could take three to four hours to complete, so be sure to request enough time for your VCL instance to finish the assembly. If you want to keep the final assembly (in the file out/flyRNA-final.fa), transfer it from the VCL instance to some permanent storage location (e.g. Google Drive, AFS file space) before the instance is terminated.

Rockhopper can be downloaded to the home directory of a VCL instance and run from the command line - for some reason the GUI version would not save the file of assembled transcripts when I tested it. All six files of RNA-seq data are from the same accession of Arabidopsis, so they can all be concatenated into a single file and provided as input to Rockhopper.

A file of Arabidopsis thaliana RNA sequences (inferred from gene models in the TAIR 10 genome assembly: TAIR10.cDNA.fa.gz) is also available. The assembled transcripts can be compared with these predicted transcripts as a means of evaluating how good a job the Rockhopper assembler (which is designed for assembly of bacterial RNA-seq datasets) does with the plant RNA-seq data.

Exercise - Evaluating Assembly Quality¶

The rnaQUAST program is installed in the ‘bioinfo’ environment, and can be run using the Drosophila melanogaster reference genome assembly and annotation. Code to download the reference genome and annotation from Google Drive is provided in the Course Notes Google Doc, along with the commands needed to build a GMAP index of the reference genome assembly and then invoke rnaQUAST to assess the quality of a transcriptome assembly. If you saved the transcriptome assembly from your Trans-ABySS run in your AFS file space, you can use the file directly from that space rather than transferring it back to the home directory of your VCL instance.

Evaluating your transcriptome assembly outlines the use of Transrate and BUSCO to evaluate de-novo transcriptome assembly quality. This comes from documentation for a short course in transcriptome sequencing, assembly and analysis held at UC Davis in 2017. They install Transrate and BUSCO by compiling from source code - for the NC State HPC, it is much easier to install both packages in a conda virtual environment using the command:
```
conda create --prefix /share/bit815s20/$USER/txptomeQC_env transrate busco
```

Additional Resources¶

Raghavan et al, 2022 A simple guide to de novo transcriptome assembly and annotation. Briefings in Bioinformatics bbab563. Full Text A recent review of the entire process of de-novo transcriptome assembly, from data pre-processing to transcriptome annotation, with lists of programs suitable for different steps in the process. An excellent overview of the current state of the art.

The Harvard Informatics facility has an online guide to Best Practices for De Novo Transcriptome Assembly with Trinity that contains example SLURM scripts for submission of different job types to a computing cluster, as well as an extensive discussion of practical considerations in transcriptome assembly.

Advanced Guide to Trinity A resource dated 2014, so it may not be completely accurate with respect to the latest version of Trinity, but it discusses the three individual component programs of Trinity (Inchworm, Chrysalis, and Butterfly) and what each does during the process of Trinity assembly. This also provides some insight into specific parameter settings that can be adjusted to improve Trinity assemblies - these recommendations may or may not work with the latest version of Trinity, though.

Several papers have reported that the most reliable approach for transcriptome assembly for different organisms is to use multiple different programs for independent assemblies, followed by merging together of the resulting assembled contigs and selection of the most complete contigs as representatives for the final completed transcriptome.

Venturini et al, 2018 Leveraging multiple transcriptome assembly methods for improved gene structure annotation. GigaScience 7(8):giy093 Full text

McManes, M.D. 2018 The Oyster River Protocol: a multi-assembler and kmer approach for de-novo transcriptome assembly. Peer J. 6:e5428. Full text This paper describes a set of criteria used to evaluate the relative quality of different transcriptome assemblies, using the software tools BUSCO, shmlast, Detonate, and Trans-Rate.

Nakasugi et al, 2014 Combining transcriptome assemblies from multiple de novo assemblers in the allo-tetraploid plant Nicotiana benthamiana. PLoS ONE 9(3): e91776. Full text

Gilbert, Donald 2013 Gene-omes built from mRNA seq not genome DNA. 7th annual arthropod genomics symposium. Notre Dame, Indiana. Poster

Correction of errors in RNA-seq reads requires consideration of the difference in relative abundance among transcripts in order to identify likely error-derived k-mers. Rcorrector is one software package capable of this process; the SEECER package described by Le et al (2013) is another.

Song & Florea, 2015. Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads. Gigascience 4:48. Full text

Le et al, 2013 Probabilistic error correction for RNA sequencing. Nucleic Acids Res. 41(10):e109. Full text

One strategy for reducing the amount of RAM required for transcriptome assembly by the Trinity software package is to carry out “digital normalization” of the RNA-seq dataset - this means adjusting the numbers of reads in the dataset to ensure more uniform representation of both abundant and rare transcripts, while removing sequencing errors. A detailed exercise is available, which uses AWS cloud computing instances to provide sufficient computing power to process a real dataset.

Analysis of Next Generation Sequencing data (ANGUS) is a workshop series on high-throughput sequence data analysis; the 2017 workshop includes an exercise on transcriptome assembly with Trinity using cloud computing resources.

Hölzer M & Marz M. 2019. De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers. Gigascience 8(5):giz039

Rana et al., 2016. Comparison of de-novo transcriptome assemblers and k-mer strategies using the killifish, Fundulus heteroclitus. PLoS One 11: e0153104. Full text

Boley et al., 2014. Genome-guided transcript assembly by integrative analysis of RNA sequence data. Nature Biotechnology 32: 341-346. Publisher Website

Grabherr et al, 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29:644 - 652. PubMed Central

Tjaden, B. (2015) De novo genome assembly of bacterial transcriptomes from RNA-seq data. Genome Biology 16:1 Full text

Class Recordings¶

Session 13: recorded February 17th 2021 (this link is video and audio). A Transcript of recording of the video is also available.
Session 14: recorded February 19th 2021 (this link is video and audio). A Transcript of recording of the video is also available.
Session 15: recorded February 22nd 2021 (this link is video and audio). A Transcript of recording of the video is also available.

Last modified 5 February 2022. Edits by Ross Whetten, Will Kohlway, & Maria Adonay.