_images/NC_State.gif

Readings and Resources

NC State Bioinformatics Users Group (BUG)

Course Notes

  • Spring 2022 Google Doc course notes An online version of notes for each class session. Students have view permission; instructors have edit access. This provides a way for instructors to write code that students can copy into a terminal window, and provides a place to record information about class sessions.
  • Spring 2021 Google Doc course notes - these may be of historical interest, or not.

Global overview books and papers

  • The Biostar Handbook - Bioinformatics Data Analysis Guide. Istvan Albert and others. Available online
  • Next generation quantitative genetics in plants. Jiménez-Gómez, Frontiers in Plant Science 2:77, 2011 Full Text [Equally relevant to animal and microbial systems]
  • Sense from sequence reads: methods for alignment and assembly. Flicek & Birney, Nat Methods 6(11 Suppl):S6-S12, 2009. Full Text

Data Management and Project Organization

  • The FAIR Guiding Principles for scientific data management and stewardship. Wilkinson, et al Sci Data 3:160018, 2016 Full Text Research data should be Findable, Accessible, Interoperable, and Reusable in order to be of maximum value to the larger scientific community
  • A quick guide to organizing computational biology projects. Noble, PLoS Comp Biol 5:e1000424, 2009 Full Text Written from the perspective of the research scientist generating and analyzing the data
  • Ten simple rules for providing effective bioinformatics research support. Kumuthini, et al. PLoS Comp Biol 16:e1007531, 2020. Full Text Written from the perspective of core bioinformatics facility service providers
  • Good enough practices in scientific computing. Wilson et al, PLoS Comp Biol 13:e1005510, 2017. Full Text Covers data management, software management, collaboration, project organization, version control, and manuscript authoring practices.
  • PM4NGS, a project management framework for next-generation sequencing data analysis. Vera Alvarez et al, GigaScience 10:giaa141, 2021 Full Text A recent publication, as-yet uncited, that describes an automated system for creating a management structure of directories, files, and data management tools based on Jupyter notebooks and Common Workflow Language (CWL). Not everyone is a fan of CWL, but the paper presents useful concepts regarding strategies for reproducible research and data management to achieve the FAIR principles.

Library construction and experimental design

  • Statistical design and analysis of RNA sequencing data. Auer & Doerge, Genetics 185(2):405-16, 2010. PubMedCentral
  • Biases in Illumina transcriptome sequencing caused by random hexamer priming. Hansen et al., Nucleic Acids Res. 38(12): e131, 2010. PubMedCentral
  • Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Aird et al, Genome Biology 12:R18, 2011 Full Text
  • Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes. Kozarewa et al, Nature Methods 6(4):291-295, 2009 PubMedCentral
  • Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture. Rohland & Reich, Genome Research 22(5): 939–946, 2012. PubMedCentral
  • Predicting the molecular complexity of sequencing libraries. Daley & Smith, Nature Methods 10(4):325-327, 2013 PubMedCentral
  • RNA-seq differential expression studies: more sequence or more replication? Liu et al., Bioinformatics 30: 301 - 304, 2014. Publisher Web Site
  • Power analysis and sample size estimation for RNA-seq differential expression. Ching et al., RNA 20: 1684 - 1696, 2014. Publisher Web Site
  • Guidance for RNA-seq co-expression network construction and analysis: safety in numbers. Ballouz et al., Bioinformatics 31: 2123 - 2130, 2014. Publisher Web Site
  • Points of significance: replication. Blainey et al., Nature Methods 11: 879–880, 2014. Publisher Web Site
  • Points of Significance: Nested designs. Krzywinski et al., Nature Methods 11: 977–978, 2014 Publisher Web Site
  • Points of significance: Sources of variation Altman & Krzywinski. Nature Methods 12: 5 – 6, 2015 Publisher Web Site
  • Compilation of DNA sequencing library preparation Methods: as a poster & an extensive methods review PDF.
  • Compilation of RNA sequencing library preparation Methods: as a poster & an extensive methods review PDF.
  • A poster compiling Single-Cell sequencing methods.
  • An overview of recent publications for cell biology and complex disease research with Illumina technology.

Data formats and alignment software tools

  • The Sequence Alignment/Map format and SAMtools. Li et al, Bioinformatics 25(16):2078-9, 2009 PubMedCentral
  • SAM format specification file
  • PAF alignment format is described in the manual page for the minimap2 long-read aligner.
  • Minimap2: pairwise alignment for nucleotide sequences. Li, Bioinformatics 34:3094-3100, 2018 https://doi.org/10.1093/bioinformatics/bty191. PubMedCentral
  • Efficient storage of high throughput sequencing data using reference-based compression. Fritz et al, Genome Res 21(5):734-40, 2011. Full Text
  • Compression of DNA sequence reads in FASTQ format. Deorowicz & Grabowski, Bioinformatics 27(6):860-2, 2011. PubMed
  • Fast and accurate short read alignment with Burrows-Wheeler transform. Li & Durbin, Bioinformatics 25(14):1754-60, 2009. PubMedCentral
  • Improving SNP discovery by base alignment quality. Li H, Bioinformatics 27(8):1157-8, 2011. PubMed
  • BEDTools: a flexible suite of utilities for comparing genomic features. Quinlan and Hall, Bioinformatics 26:841-842, 2010. Publisher Website
  • The variant call format and VCFtools. Danecek et al, Bioinformatics 27:2156-2158, 2011. PubMedCentral
  • The UC Santa Cruz Genome Browser FAQ on data file formats

Data quality assessment, filtering, and correction

  • HTQC: a fast quality control toolkit for Illumina sequencing data. Yang et al, BMC Bioinformatics 14:33, 2013. PubMed
  • FastQC: a quality control tool for high-throughput sequence data. Home Page
  • FASTX-toolkit: FASTQ/A short-reads pre-processing tools Home Page
  • QuorUM: an error corrector for Illumina reads. Marçais et al. 2013 Arxiv preprint or 2015 PLoSOne paper
  • Quake: quality-aware detection and correction of sequencing errors. Kelley et al, Genome Biol 11(11):R116, 2010. PubMed
  • Reference-free validation of short read data. Schröder et al, PLoS One 5(9):e12681, 2010. PubMedCentral
  • Correction of sequencing errors in a mixed set of reads. Salmela, Bioinformatics 26(10):1284, 2010. Full Text [Includes error correction of SOLiD reads in colorspace.]
  • Repeat-aware modeling and correction of short read errors. Yang et al, BMC Bioinformatics 12(Supp1):S52, 2011 PubMedCentral [Requires a reference sequence.]
  • HiTEC: accurate error correction in high-throughput sequencing data. Ilie et al, Bioinformatics 27(3):295, 2011 Full Text
  • Error correction of high-throughput sequencing datasets with non-uniform coverage. Medvedev et al., Bioinformatics 27(13):i137-41, 2011. PubMedCentral
  • Characterization of the Conus bullatus genome and its venom-duct transcriptome. Hu et al., BMC Genomics 12:60, 2011 Full Text [Includes a novel strategy for estimating genome size from a partial transcriptome assembly and low-coverage (3x) genome sequence.]

De novo assembly

  • Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Zerbino & Birney, Genome Res 18(5):821-9, 2008. PubMedCentral
  • Assembly of large genomes using second-generation sequencing. Schatz et al, Genome Res 20(9):1165-73, 2010. PubMedCentral
  • High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Gnerre et al, PNAS 108(4): 1513-18, 2011 PubMedCentral
  • Genome assembly has a major impact on gene content: a comparison of annotation in two Bos taurus assemblies. Florea et al., PLoS One 6(6):e21400, 2011. PubMedCentral
  • Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Carver et al, Bioinformatics 28(4):464 - 469, 2012 PubMedCentral
  • Efficient de novo assembly of large genomes using compressed data structures. Simpson & Durbin, Genome Research 22:549-556, 2012 Full Text [Describes the String Graph Assembler (SGA), which assembled a human genome in less than 6 days using 54 Gb of RAM and a 123-processor compute cluster for calculation of an FM-index of the 1.2 billion reads]
  • Readjoiner: a fast and memory efficient string graph-based sequence assembler. Gonnella & Kurtz, BMC Bioinformatics 13: 82, 2012 PubMedCentral
  • Assemblathon 1: A competitive assessment of de novo short read assembly methods. Earl et al, Genome Research 21:2224-2241, 2011 Full Text

Chromatin analysis

Bias Correction

  • Identifying and mitigating bias in next-generation sequencing methods for chromatin biology. Meyer and Liu, Nat Rev Genetics 15: 709 - 721, 2014 Publisher Web Site

Chromatin Immunoprecipitation sequencing: ChIP-seq

  • ChIP-seq: advantages and challenges of a maturing technology. Park, Nat Rev Genet. 10:669-80, 2009 PubMed
  • ChIP-seq and Beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Furey, Nat Rev Genet 13: 840–852, 2012 Publisher Web Site
  • MuMoD: a Bayesian approach to detect multiple modes of protein–DNA binding from genome-wide ChIP data. Narlikar, Nucleic Acids Res 41:21–32, 2013 PubMed

Chromatin conformation

  • A decade of 3C technologies: insights into nuclear organization. de Wit & de Laat, Genes & Devel 26: 11-24, 2012 Publisher Website
  • Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Dekker et al, Nature Reviews Genetics 14: 390–403, 2013 Publisher Website

Transcriptome analysis

General considerations for RNA-seq library construction

  • Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations. Fu et al, PNAS 111:1891–1896, 2014 Publisher Web Site

Assembly and comparison to genome

  • A glance at quality score: implication for de novo transcriptome reconstruction of Illumina reads. Mbandi et al., Frontiers in Genetics 2014. Publisher Website
  • Full-length transcriptome assembly from RNA-Seq data without a reference genome. Grabherr et al, Nature Biotechnology 29:644 - 652, 2011. PubMed Software called Trinity; is available on Github.
  • Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. Peng et al, Nature Biotechnology 30:253 - 260, 2012. PubMed Several comments on this paper question whether the reported differences are in fact evidence of editing or are simply sequencing errors - the authors stand by their conclusions, but the controversy demonstrates the importance of robust data analysis methods.
  • Optimization of de novo transcriptome assembly from next-generation sequencing data. Surget-Groba & Montoya-Burgos, Genome Res 20(10):1432-40, 2010. Full Text
  • Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. Martin et al, BMC Genomics 11:663, 2010 Full Text
  • De novo assembly and analysis of RNA-seq data. Robertson et al, Nature Methods 7:909-912, 2010 Full Text Describes Trans-ABySS, a pipeline to use the ABySS parallel assembler for de novo transcriptome analysis.

Differential expression analysis

  • Robust adjustment of sequence tag abundance. Baumann & Doerge, Bioinformatics 2013 PubMed
  • R-SAP: a multi-threading computational pipeline for the characterization of high-throughput RNA-sequencing data. Mittal & McDonald, Nucleic Acids Res, 2012 Full Text
  • Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Mercer et al, Nature Biotechnology 30:99 - 104, 2012 Publisher Website
  • Differential gene and transcript expression analysis of RNA-Seq experiments with TopHat and Cufflinks. Trapnell et al, Nature Protocols 7:562 - 578, 2012 Publisher Website
  • Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Łabaj et al, Bioinformatics 27:i383 - i391, 2011 Full Text
  • Improving RNA-Seq expression estimates by correcting for fragment bias. Roberts et al, Genome Biol 12:R22, 2011 PubMed Central
  • Cloud-scale RNA-sequencing differential expression analysis with Myrna. Langmead et al, Genome Biol 11:R83, 2010 Full Text
  • From RNA-seq reads to differential expression results. Oshlack et al, Genome Biol 11(12):220, 2010 Full Text
  • DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Wang et al., Bioinformatics. 26(1):136-8. 2010 PubMed
  • DEseq: Differential expression analysis for sequence count data. Anders and Huber, Genome Biology 11:R106, 2010 Full Text
  • Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Love et al, BioRxiv doi: 10.1101/002832, 2014 Full Text
  • edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Robinson et al., Bioinformatics 26(1):139-40 2010 PubMedCentral
  • Two-stage Poisson model for testing RNA-seq data. Auer and Doerge, SAGMB 10(1), article 26 Full Text
  • Experimental design, preprocessing, normalization and differential expression analysis of small RNA sequencing experiments. McCormick et al., Silence2(1):2, 2011 PubMedCentral
  • RNA-Seq gene expression estimation with read mapping uncertainty. Li et al, Bioinformatics 26:493-500, 2010 PubMedCentral [Describes the RSEM software package.]

Comparing genomes and assemblies; variant detection

  • Toward better understanding of artifacts in variant calling from high-coverage samples. Heng Li, Bioinformatics 30, 2843, 2014 PubMedCentral
  • Versatile and open software for comparing large genomes. Kurtz et al, Genome Biol (5(2):R12, 2004. PubMedCentral [Describes the MUMmer software for full-genome alignment & comparisons.]
  • Searching for SNPs with cloud computing. Langmead et al, Genome Biol 10(11):R134, 2009 Full Text
  • Calling SNPs without a reference sequence. Ratan et al, BMC Bioinformatics 11:130, 2010 PubMedCentral
  • Microindel detection in short-read sequence data. Krawitz et al, Bioinformatics 26(6):722-9, 2010. Full Text
  • vipR: variant identification in pooled DNA using R. Altmann et al., Bioinformatics 27: i77-i84, 2011. PubMedCentral
  • Geoseq: a tool for dissecting deep-sequencing datasets. Gurtowski et al, BMC Bioinformatics 11:506, 2010. PubMedCentral [Geoseq is a web service that allows searching deep sequencing datasets with a reference sequence of a gene of interest.]
  • Detecting and annotating genetic variations using the HugeSeq pipeline. Lam et al, Nature Biotechnology 30:226 - 229, 2012 Publisher Website, Home Page
  • Genome-wide LORE1 retrotransposon mutagenesis and high-throughput insertion detection in Lotus japonicus. Urbański et al, Plant J 64:731-741, 2012. Publisher Website [This paper describes a 2-dimensional pooling strategy with barcoding to allow use of Illumina sequencing to screen for retrotransposon insertion mutations, and includes a software package called FSTpoolit for analysis of the resulting sequence reads.]
  • Reproducibility of variant calls in replicate next-generation sequencing experiments. Qi et al., PLoS One 10: e0119230, 2015 Full Text

Genotyping by sequencing

  • Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Davey et al., Nat Rev Genet 12(7):499-510, 2011 PubMed [A review of methods available at the time.]
  • A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. Elshire et al., PLoS One 6(5):e19379, 2011. Full Text
  • Development of high-density genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach. Poland et al., PLoS One 7(2): e32253, 2012. Full Text
  • Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. Peterson et al, PLoS One 7(5):e37135, . 2012. Full Text
  • Imputation of unordered markers and the impact on genomic selection accuracy. Rutkowski et al, G3 3(3):427-39, 2013. Full Text
  • Diversity Arrays Technology (DArT) and next-generation sequencing combined: genome-wide, high-throughput, highly informative genotyping for molecular breeding of Eucalyptus. Sansaloni et al., BMC Proceedings 5(Suppl 7):P54, 2011 Full Text
  • High-throughput genotyping by whole-genome resequencing. Huang et al., Genome Res 19(6):1068-76, 2009. Full Text
  • Multiplexed shotgun genotyping for rapid and efficient genetic mapping. Andolfatto et al. Genome Res 21(4):610-7, 2011. Full Text

Restriction-site Associated DNA (RAD) markers

  • Rapid SNP discovery and genetic mapping using sequenced RAD markers. Baird et al, PLoS One 3(10):e3376, 2008 Full Text
  • Linkage mapping and comparative genomics using next-generation RAD sequencing of a non-model organism. Baxter et al., PLoS One 6(4):e19315, 2011. Full Text
  • Genome evolution and meiotic maps by massively parallel DNA sequencing: spotted gar, an outgroup for the teleost genome duplication. Amores et al, Genetics 188(4):799-808, 2011. PubMed
  • Construction and application for QTL analysis of a Restriction-site Associated DNA (RAD) linkage map in barley. Chutimanitsakun et al, BMC Genomics 4; 12:4, 2011. Full Text
  • RAD tag sequencing as a source of SNP markers in Cynara cardunculus L. Scaglione et al., BMC Genomics 13:3, 2012. Full Text
  • Paired-end RAD-seq for de novo assembly and marker design without available reference. Willing et al., Bioinformatics 27(16):2187-93, 2011. Publisher Website
  • Local de novo assembly of RAD paired-end contigs using short sequencing reads. Etter et al., PLOS ONE 6(4): e18561, 2011. Full Text
  • Stacks: building and genotyping loci de novo from short-read sequences. Catchen et al., G3: Genes, Genomes, Genetics, 1:171-182, 2011. Home Page
  • Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads. Chong et al, Bioinformatics 28(21):2732-7, 2012. Publisher Website
  • UK RAD Sequencing Wiki page, with bibliography and RADTools software download Home Page

Population Genomics

  • PGDspider: an automated data conversion tool for connecting population genetics and genomics programs. Lischer & Excoffier, Bioinformatics 28: 298-299, 2012 Publisher Website

Workspace environments

Papers
  • Using prototyping to choose a bioinformatics workflow management system. Jackson et al, PLoS Comput Biol. 17:e1008622, 2021. Full Text A description of how the authors compared four different workflow management systems for their analytical pipeline development project before choosing Nextflow
  • Nextflow enables reproducible computational workflows. Di Tommaso et al, Nat Biotechnol 35:316-319, 2017 Publisher Website The paper describing the Nextflow workflow management system - this does not provide much guidance for how to use Nextflow; for that information see the Nextflow documentation
  • Singularity: Scientific containers for mobility of compute. Kurtzer et al, PLoS ONE 12(5): e0177459, 2017. Full Text Containers are an important aspect of reproducible research, and Singularity is specifically designed to be compatible with use in cluster-computing environments and interoperable with Docker. Information on installing and using Singularity is available in the documentation.
  • Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Goecks et al, Genome Biol 11(8):R86, 2010 PubMedCentral
  • Galaxy Cloudman: Delivering compute clusters. BMC Bioinformatics 11(Suppl. 12):S4, 2010 Full Text
  • The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. McKenna et al, Genome Res 20(9):1297-303, 2010. PubMedCentral
  • A framework for variation discovery and genotyping using next-generation DNA sequencing data. DePristo et al., Nat Genet 43(5):491-8, 2011. PubMed
Online resources

Manuals and contributed documentation for R are available at the R-project.org website, and video tutorials are also available on Youtube; those posted by Tutorlol are brief, clear, and to the point.

Materials from a series of mini-courses in R taught in 2010 at UCLA are available:

A Little Book of R for Bioinformatics is an on-line resource with information and exercises to provide practice in bioinformatics analysis of DNA sequences and other biological data in R. Many books on specific topics in R programming are also available through Amazon or other vendors.

Cloud computing resources
  • The case for cloud computing in genome informatics. Lincoln Stein, Genome Biol. 11(5):207, 2010 Pubmed
  • Galaxy Cloudman: delivering cloud compute clusters. Afgan et al, BMC Bioinformatics 11(Suppl 12):S4, 2010 Full Text
  • CloudBioLinux is an open-source project that provides a bioinformatics Linux system for cloud computing, pre-configured with a variety of software tools installed and ready to use.
  • A tutorial on getting started with CloudBioLinux on the Amazon Web Services Elastic Compute Cloud (EC2)
  • Deploying Galaxy on the Cloud slides from a presentation by Enis Afgan (Emory University) at the Bioinformatics Open Source Conference in Boston, July 2010
  • A screencast that provides a step-by-step guide to starting a Galaxy cluster in the EC2 environment
  • A webpage that has the same information in text form, and is the basis for the screencast
  • The iPlant Collaborative, an NSF-funded project to create computational resources for plant biology research, provides access to cloud computing resources through Atmosphere
  • SeqWare Query Engine: storing and searching sequence data in the cloud. O’Connor et al, BMC Bioinformatics 11(Suppl 12):S2, 2010 Full Text
  • An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. Taylor, BMC Bioinformatics 11(Suppl 12):S1, 2010 Full Text

Links to Linux command-line tutorials and resources

  • Data Science at the Command Line A free online book by Jeroen Janssens, also available in hard copy form from O’Reilly Media. The second edition was published in 2021, updating the first edition published in 2014. The Preface has a nice explanation of the author’s motivation and rationale for writing the book.
  • The Linux Command Line by William Shotts. Another free online book, also available as a free PDF download or in print form from No Starch Press
Tutorials for AWK, a powerful tool for handling data tables
Tutorials for bash shell scripting
Tutorials for sed, the command-line stream editor