Comprehensive comparative analysis of strand specific rna sequencing methods

Journal Article

Krishna A Srinivasan,

Search for other works by this author on:

Suman K Virdee,

Search for other works by this author on:

Andrew G McArthur

Corresponding author: A.G. McArthur, Department of Biochemistry and Biomedical Sciences, DeGroote School of Medicine, McMaster University, Hamilton, Ontario L8S 4K1, Canada. Tel.: +1-905-525-9140 ×21663; Fax:

+1-905-528-5330

; E-mail:

Search for other works by this author on:

Received:

07 January 2020

Revision received:

20 March 2020

  • PDF
  • Split View
    • Article contents
    • Figures & tables
    • Video
    • Audio
    • Supplementary Data
  • Cite

    Cite

    Krishna A Srinivasan, Suman K Virdee, Andrew G McArthur, Strandedness during cDNA synthesis, the stranded parameter in htseq-count and analysis of RNA-Seq data, Briefings in Functional Genomics, Volume 19, Issue 5-6, September-November 2020, Pages 339–342, https://doi.org/10.1093/bfgp/elaa010

    Close

    • Email
    • Twitter
    • Facebook
    • More

Close

Navbar Search Filter Microsite Search Term Search

Abstract

RNA sequencing (RNA-Seq) is a complicated protocol, both in the laboratory in generation of data and at the computer in analysis of results. Several decisions during RNA-Seq library construction have important implications for analysis, most notably strandedness during complementary DNA library construction. Here, we clarify bioinformatic decisions related to strandedness in both alignment of DNA sequencing reads to reference genomes and subsequent determination of transcript abundance.

Introduction

RNA sequencing (RNA-Seq) is a powerful method for quantifying the transcriptome and identifying differential gene expression [1–3]. RNA-Seq produces a large amount of data; therefore, software packages designed for data analysis are vital for the advancement of the field (Figure 1). Based on our work with RNA-Seq of zebrafish embryos and human cell lines, we identify a few key aspects of RNA-Seq analysis not consistently described in the literature.

Figure 1

Comprehensive comparative analysis of strand specific rna sequencing methods

A standard RNA-Seq data analysis pipeline. FASTQC [9] is the first tool in the pipeline and involves checking the quality of raw reads. Trimmomatic [10], an optional step, may be used to trim low-quality bases if read quality is lower than a specified threshold. HISAT2 [11] is used to align reads to a reference genome. Compared with other read aligners, HISAT2 is more capable of dealing with reads that incorrectly map to pseudogenes. After alignment, htseq-count [4] identifies the number of reads that map to genes. DESeq2 [12] uses the output of htseq-count to identify differentially expressed genes.

htseq-count and strandedness

Anders and colleagues developed HTSeq [4], a python framework for analyzing high throughput sequencing data, within which htseq-count is located. htseq-count is a commonly used tool in many data analysis pipelines [3,5,6], which counts the number of sequencing reads mapping to genes after the read alignment step. Certain parameters in htseq-count are dependent upon the laboratory methods used; in particular, the ‘stranded’ parameter is dependent upon the complementary DNA (cDNA) library construction methodology. The three possible settings are ‘Yes,’ ‘No’ and ‘Reverse’ and refer to strandedness during library construction. Strand-specific protocols involve preserving information on which strand the messenger RNA (mRNA) originated, while this information is lost in un-stranded protocols (Figure 2).

Figure 2

Comprehensive comparative analysis of strand specific rna sequencing methods

First strand, second strand and un-stranded cDNA library protocols. dUTP = deoxyuridine triphosphate, PCR = polymerase chain reaction, R1 = Read 1 (first read produced), R2 = Read 2 (second read produced).

For first strand synthesis, strand information is preserved by degrading the non-template strand and PCR amplifying only the template cDNA strand during library construction (the ‘Reverse’ option in htseq-count). Second strand protocols preserve strand information by using adapters to mark the template strand (the ‘Yes’ option in htseq-count, which is the default). When aligning reads from bi-directional Illumina sequencing of a first strand library, the second read produced will map to the DNA strand containing the coding sequence while the first read will map to the opposite strand. For a second strand cDNA library, the first read produced will map to the DNA strand containing the coding sequence while the second read will map to the opposite strand. The default ‘Yes’ stranded option in htseq-count may be incorrectly interpreted merely as a confirmation of having strand-specific data, instead of specifically indicating data from second strand library. Moreover, the ht-seq documentation does not explicitly state which stranded option to use based on cDNA library construction. Using stranded ‘Yes’ on a first strand cDNA library results in all read pairs that map to the reference genome in the correct orientation being discarded and reads that map in the incorrect orientation being counted as legitimate reads. Essentially, only antisense data are considered for transcript abundance measures. First strand cDNA libraries thus require the use of the stranded ‘Reverse’ option (Figure 3). A scatter plot of stranded ‘Reverse’ against ‘Yes’ read counts may be useful if cDNA library construction in unknown. Stranded ‘Reverse’ would be the correct option if the majority of the genes occur in a steep vertical line, while stranded ‘Yes’ would be correct if the majority of the genes were oriented in a horizontal line.

Figure 3

Comprehensive comparative analysis of strand specific rna sequencing methods

htseq-count read counts for zebrafish genes using stranded ‘Reverse’ versus stranded ‘Yes’ with a first strand cDNA library, with each point representing a single gene. Reads were aligned to the zebrafish genome (GRCz10) using HISAT2 (version 2.0.5.2) and subsequently quantified using htseq-count (version 0.6.1). Stranded ‘Reverse’ produces larger counts for the majority of genes. In contrast, data is being unnecessarily removed using the ‘Yes’ option. Approximately 700 000 read pairs were mapped to genes using stranded ‘Yes,’ while stranded ‘Reverse’ had 21 million read pairs, using an initial input of 25 million reads pairs.

Conclusions

Second strand cDNA libraries may have been more common when htseq-count was designed than now, resulting in the default ‘Yes’ option. However, first strand cDNA libraries with dUTP second strand marking, such as the NEBNext Ultra Directional RNA Prep Kit (New England Biolabs, Ipswich, MA, USA), have been shown to be one of the most reliable and accurate techniques for RNA-seq [7] and require the ‘Reverse’ option. We also note that the upstream read alignment step performed by HISAT2 includes the ‘rna-strandness’ parameter (which add tags in the alignment file for read strandedness), as well the ‘nofw’ and ‘norc’ parameters (restricting read alignments to specific strands). The defaults are unstranded and no strand-based restrictions upon read alignment, respectively, but alternative settings could restrict the creation of read alignments for interpretation by htseq-count. The language used differs, where unstranded in HISAT2 corresponds to htseq-count’s ‘No,’ HISAT2’s forward corresponds to htseq-count’s ‘Yes’ (i.e. second strand cDNA libraries) and HISAT2’s reverse corresponds to htseq-count’s ‘Reverse’ (i.e. first strand cDNA libraries). HISAT2’s precurser Tophat [8] used FR-unstranded for ‘No,’ FR-second strand for ‘Yes’ and FR-first strand for ‘Reverse.’ As with any bioinformatics software, we recommend that users be aware of the relationship between software parameters and laboratory techniques, and use a stranded option that matches their cDNA library protocols to prevent unnecessary errors.

  • (i) Language describing parameters for handling strandedness in popular RNA-Seq bioinformatics tools is unclear, leading to erroneous measurement of antisense transcript abundance instead of the sense transcriptome when using default parameters.

  • (ii) First strand cDNA libraries require the use of the stranded ‘Reverse’ option for RNA-Seq workflow using the htseq-count tool.

Key Points

  • Strandedness during cDNA)library construction has important implications for RNA-Seq analysis.

  • First strand cDNA libraries with dUTP second strand marking are now commonly used for RNA-seq.

Funding

A.G.M. held a Cisco Research Chair in Bioinformatics, supported by Cisco Systems Canada, Inc.

Andrew G. McArthur is an associate professor at McMaster University with a research program focussed on bioinformatic approaches to antimicrobial resistance and environmental toxicology. K.A.S. and S.K.V. are recent graduates of the McArthur laboratory.

References

1.

Mortazavi

 

A

,

Williams

 

BA

,

McCue

 

K

, et al.   

Mapping and quantifying mammalian transcriptomes by RNA-Seq

.

Nat Methods

 

2008

;

5

:

621

628

.

2.

Wang

 

Z

,

Gerstein

 

M

,

Snyder

 

M

.

RNA-Seq: a revolutionary tool for transcriptomics

.

Nat Rev Genet

 

2009

;

10

:

57

63

.

3.

Conesa

 

A

,

Madrigal

 

P

,

Tarazona

 

S

, et al.   

A survey of best practices for RNA-seq data analysis

.

Genome Biol

 

2016

;

17

:

1

19

.

4.

Anders

 

S

,

Pyl

 

PT

,

Huber

 

W

.

HTSeq—a python framework to work with high-throughput sequencing data

.

Bioinformatics

 

2015

;

31

:

166

169

.

5.

Fonseca

 

NA

,

Marioni

 

J

,

Brazma

 

A

.

RNA-Seq gene profiling—a systematic empirical comparison

.

PLoS One

 

2014

;

9

:

e107026

.

6.

Wu

 

P-Y

,

Phan

 

JH

,

Wang

 

MD

.

Assessing the impact of human genome annotation choice on RNA-seq expression estimates

.

BMC Bioinformatics

 

2013

;

14

(

Suppl 11

):

S8

S13

.

7.

Levin

 

JZ

,

Yassour

 

M

,

Adiconis

 

X

, et al.   

Comprehensive comparative analysis of strand-specific RNA sequencing methods

.

Nat Methods

 

2010

;

7

:

709

715

.

8.

Kim

 

D

,

Pertea

 

G

,

Trapnell

 

C

, et al.   

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

.

Genome Biol

 

2013

;

14

:

R36

R13

.

9.

Blankenberg

 

D

,

Gordon

 

A

,

Kuster Von

 

G

, et al.   

Manipulation of FASTQ data with galaxy

.

Bioinformatics

 

2010

;

26

:

1783

1785

.

10.

Bolger

 

AM

,

Lohse

 

M

,

Usadel

 

B

.

Trimmomatic: a flexible trimmer for illumina sequence data

.

Bioinformatics

 

2014

;

30

:

2114

2120

.

11.

Kim

 

D

,

Langmead

 

B

,

Salzberg

 

SL

.

HISAT: a fast spliced aligner with low memory requirements

.

Nat Methods

 

2015

;

12

:

357

360

.

12.

Love

 

MI

,

Huber

 

W

,

Anders

 

S

.

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

.

Genome Biol

 

2014

;

15

:

550

.

© The Author(s) 2020. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s) 2020. Published by Oxford University Press.

What is strand specific RNA sequencing?

RNA-Seq is a recently developed sequencing technology, that through the analysis of cDNA allows for unique insights into the transcriptome of a cell. The data generated by RNA-Seq provides information on gene expression, alternative splicing events and the presence of non-coding RNAs.

How do you analyze data in RNA sequencing?

For most RNA‐seq studies, the data analyses consist of the following key steps [5, 6]: (1) quality check and preprocessing of raw sequence reads, (2) mapping reads to a reference genome or transcriptome, (3) counting reads mapped to individual genes or transcripts, (4) identification of differential expression (DE) ...

What are RNA

RNA-Seq allows researchers to detect both known and novel features in a single assay, enabling the identification of transcript isoforms, gene fusions, single nucleotide variants, and other features without the limitation of prior knowledge.

What method can be used to compare the transcriptomes of individual single cells?

Introduction to Single-Cell RNA Sequencing Single-cell sequencing is a next-generation sequencing (NGS) method that examines the genomes or transcriptomes of individual cells, providing a high-resolution view of cell-to-cell variation.