Comprehensive comparative analysis of strand specific rna sequencing methods

Journal Article

Table of Contents Show

Introduction
htseq-count and strandedness
Conclusions
What is strand specific RNA sequencing?
How do you analyze data in RNA sequencing?
What are RNA
What method can be used to compare the transcriptomes of individual single cells?

Krishna A Srinivasan,

Search for other works by this author on:

Suman K Virdee,

Search for other works by this author on:

Andrew G McArthur

Corresponding author: A.G. McArthur, Department of Biochemistry and Biomedical Sciences, DeGroote School of Medicine, McMaster University, Hamilton, Ontario L8S 4K1, Canada. Tel.: +1-905-525-9140 ×21663; Fax:

+1-905-528-5330

; E-mail:

Search for other works by this author on:

Received:

07 January 2020

Revision received:

20 March 2020

PDF
Split View
- Article contents
- Figures & tables
- Video
- Audio
- Supplementary Data
Cite
Cite
Krishna A Srinivasan, Suman K Virdee, Andrew G McArthur, Strandedness during cDNA synthesis, the stranded parameter in htseq-count and analysis of RNA-Seq data, Briefings in Functional Genomics, Volume 19, Issue 5-6, September-November 2020, Pages 339–342, https://doi.org/10.1093/bfgp/elaa010
Close
- Email
- Twitter
- Facebook
- More

Navbar Search Filter Microsite Search Term Search

Abstract

RNA sequencing (RNA-Seq) is a complicated protocol, both in the laboratory in generation of data and at the computer in analysis of results. Several decisions during RNA-Seq library construction have important implications for analysis, most notably strandedness during complementary DNA library construction. Here, we clarify bioinformatic decisions related to strandedness in both alignment of DNA sequencing reads to reference genomes and subsequent determination of transcript abundance.

Introduction

RNA sequencing (RNA-Seq) is a powerful method for quantifying the transcriptome and identifying differential gene expression [1–3]. RNA-Seq produces a large amount of data; therefore, software packages designed for data analysis are vital for the advancement of the field (Figure 1). Based on our work with RNA-Seq of zebrafish embryos and human cell lines, we identify a few key aspects of RNA-Seq analysis not consistently described in the literature.

Figure 1

A standard RNA-Seq data analysis pipeline. FASTQC [9] is the first tool in the pipeline and involves checking the quality of raw reads. Trimmomatic [10], an optional step, may be used to trim low-quality bases if read quality is lower than a specified threshold. HISAT2 [11] is used to align reads to a reference genome. Compared with other read aligners, HISAT2 is more capable of dealing with reads that incorrectly map to pseudogenes. After alignment, htseq-count [4] identifies the number of reads that map to genes. DESeq2 [12] uses the output of htseq-count to identify differentially expressed genes.

htseq-count and strandedness

Anders and colleagues developed HTSeq [4], a python framework for analyzing high throughput sequencing data, within which htseq-count is located. htseq-count is a commonly used tool in many data analysis pipelines [3,5,6], which counts the number of sequencing reads mapping to genes after the read alignment step. Certain parameters in htseq-count are dependent upon the laboratory methods used; in particular, the ‘stranded’ parameter is dependent upon the complementary DNA (cDNA) library construction methodology. The three possible settings are ‘Yes,’ ‘No’ and ‘Reverse’ and refer to strandedness during library construction. Strand-specific protocols involve preserving information on which strand the messenger RNA (mRNA) originated, while this information is lost in un-stranded protocols (Figure 2).

Figure 2

First strand, second strand and un-stranded cDNA library protocols. dUTP = deoxyuridine triphosphate, PCR = polymerase chain reaction, R1 = Read 1 (first read produced), R2 = Read 2 (second read produced).

For first strand synthesis, strand information is preserved by degrading the non-template strand and PCR amplifying only the template cDNA strand during library construction (the ‘Reverse’ option in htseq-count). Second strand protocols preserve strand information by using adapters to mark the template strand (the ‘Yes’ option in htseq-count, which is the default). When aligning reads from bi-directional Illumina sequencing of a first strand library, the second read produced will map to the DNA strand containing the coding sequence while the first read will map to the opposite strand. For a second strand cDNA library, the first read produced will map to the DNA strand containing the coding sequence while the second read will map to the opposite strand. The default ‘Yes’ stranded option in htseq-count may be incorrectly interpreted merely as a confirmation of having strand-specific data, instead of specifically indicating data from second strand library. Moreover, the ht-seq documentation does not explicitly state which stranded option to use based on cDNA library construction. Using stranded ‘Yes’ on a first strand cDNA library results in all read pairs that map to the reference genome in the correct orientation being discarded and reads that map in the incorrect orientation being counted as legitimate reads. Essentially, only antisense data are considered for transcript abundance measures. First strand cDNA libraries thus require the use of the stranded ‘Reverse’ option (Figure 3). A scatter plot of stranded ‘Reverse’ against ‘Yes’ read counts may be useful if cDNA library construction in unknown. Stranded ‘Reverse’ would be the correct option if the majority of the genes occur in a steep vertical line, while stranded ‘Yes’ would be correct if the majority of the genes were oriented in a horizontal line.

Figure 3

htseq-count read counts for zebrafish genes using stranded ‘Reverse’ versus stranded ‘Yes’ with a first strand cDNA library, with each point representing a single gene. Reads were aligned to the zebrafish genome (GRCz10) using HISAT2 (version 2.0.5.2) and subsequently quantified using htseq-count (version 0.6.1). Stranded ‘Reverse’ produces larger counts for the majority of genes. In contrast, data is being unnecessarily removed using the ‘Yes’ option. Approximately 700 000 read pairs were mapped to genes using stranded ‘Yes,’ while stranded ‘Reverse’ had 21 million read pairs, using an initial input of 25 million reads pairs.

Conclusions

Second strand cDNA libraries may have been more common when htseq-count was designed than now, resulting in the default ‘Yes’ option. However, first strand cDNA libraries with dUTP second strand marking, such as the NEBNext Ultra Directional RNA Prep Kit (New England Biolabs, Ipswich, MA, USA), have been shown to be one of the most reliable and accurate techniques for RNA-seq [7] and require the ‘Reverse’ option. We also note that the upstream read alignment step performed by HISAT2 includes the ‘rna-strandness’ parameter (which add tags in the alignment file for read strandedness), as well the ‘nofw’ and ‘norc’ parameters (restricting read alignments to specific strands). The defaults are unstranded and no strand-based restrictions upon read alignment, respectively, but alternative settings could restrict the creation of read alignments for interpretation by htseq-count. The language used differs, where unstranded in HISAT2 corresponds to htseq-count’s ‘No,’ HISAT2’s forward corresponds to htseq-count’s ‘Yes’ (i.e. second strand cDNA libraries) and HISAT2’s reverse corresponds to htseq-count’s ‘Reverse’ (i.e. first strand cDNA libraries). HISAT2’s precurser Tophat [8] used FR-unstranded for ‘No,’ FR-second strand for ‘Yes’ and FR-first strand for ‘Reverse.’ As with any bioinformatics software, we recommend that users be aware of the relationship between software parameters and laboratory techniques, and use a stranded option that matches their cDNA library protocols to prevent unnecessary errors.

(i) Language describing parameters for handling strandedness in popular RNA-Seq bioinformatics tools is unclear, leading to erroneous measurement of antisense transcript abundance instead of the sense transcriptome when using default parameters.
(ii) First strand cDNA libraries require the use of the stranded ‘Reverse’ option for RNA-Seq workflow using the htseq-count tool.

Key Points

Strandedness during cDNA)library construction has important implications for RNA-Seq analysis.
First strand cDNA libraries with dUTP second strand marking are now commonly used for RNA-seq.

Funding

A.G.M. held a Cisco Research Chair in Bioinformatics, supported by Cisco Systems Canada, Inc.

Andrew G. McArthur is an associate professor at McMaster University with a research program focussed on bioinformatic approaches to antimicrobial resistance and environmental toxicology. K.A.S. and S.K.V. are recent graduates of the McArthur laboratory.

References

Mortazavi

Williams

McCue

, et al.

Mapping and quantifying mammalian transcriptomes by RNA-Seq

Nat Methods

2008

;

621

–

628

Wang

Gerstein

Snyder

RNA-Seq: a revolutionary tool for transcriptomics

Nat Rev Genet

2009

;

–

Conesa

Madrigal

Tarazona

, et al.

A survey of best practices for RNA-seq data analysis

Genome Biol

2016

;

–

Anders

Pyl

Huber

HTSeq—a python framework to work with high-throughput sequencing data

Bioinformatics

2015

;

166

–

169

Fonseca

Marioni

Brazma

RNA-Seq gene profiling—a systematic empirical comparison

PLoS One

2014

;

e107026

P-Y

Phan

Wang

Assessing the impact of human genome annotation choice on RNA-seq expression estimates

BMC Bioinformatics

2013

;

(

Suppl 11

–

S13

Levin

Yassour

Adiconis

, et al.

Comprehensive comparative analysis of strand-specific RNA sequencing methods

Nat Methods

2010

;

709

–

715

Kim

Pertea

Trapnell

, et al.

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

Genome Biol

2013

;

R36

–

R13

Blankenberg

Gordon

Kuster Von

, et al.

Manipulation of FASTQ data with galaxy

Bioinformatics

2010

;

1783

–

1785

10.

Bolger

Lohse

Usadel

Trimmomatic: a flexible trimmer for illumina sequence data

Bioinformatics

2014

;

2114

–

2120

11.

Kim

Langmead

Salzberg

HISAT: a fast spliced aligner with low memory requirements

Nat Methods

2015

;

357

–

360

12.

Love

Huber

Anders

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Genome Biol

2014

;

550

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

What is strand specific RNA sequencing?

RNA-Seq is a recently developed sequencing technology, that through the analysis of cDNA allows for unique insights into the transcriptome of a cell. The data generated by RNA-Seq provides information on gene expression, alternative splicing events and the presence of non-coding RNAs.

How do you analyze data in RNA sequencing?

For most RNA‐seq studies, the data analyses consist of the following key steps [5, 6]: (1) quality check and preprocessing of raw sequence reads, (2) mapping reads to a reference genome or transcriptome, (3) counting reads mapped to individual genes or transcripts, (4) identification of differential expression (DE) ...

What are RNA

RNA-Seq allows researchers to detect both known and novel features in a single assay, enabling the identification of transcript isoforms, gene fusions, single nucleotide variants, and other features without the limitation of prior knowledge.

What method can be used to compare the transcriptomes of individual single cells?

Introduction to Single-Cell RNA Sequencing Single-cell sequencing is a next-generation sequencing (NGS) method that examines the genomes or transcriptomes of individual cells, providing a high-resolution view of cell-to-cell variation.