2018-12-30-RNA-seq分析流程整理

水稻GSE81436的RNA-seq分析流程(Cr-DJ_osddm1a1b)

Posted by DL on December 30, 2018

正所谓前人栽树,后人乘凉。

本文参考来源:《RNA-seq数据分析实用方法》、《生物信息学计算技术和软件导论》、《新一代测序数据分析》、网上的博客及本人自己的RNA-seq分析过程。

一、数据准备

  • 双端测序的fastq文件:control_1.fastq、control_2.fastq、treatment_1.fastq、treatment_2.fastq(举例)。

  • 全基因组文件:genome_xxx.fa

  • gtf注释文件:genome_xxx.gtf

1.1 测序数据下载

推荐使用IBM的aspera,速度非常快。

dingli@wwkong-T7910:~/GSE81436/data$ ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 200m anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR350/SRR3503143/SRR3503143.sra /home/dingli/GSE81436/download

dingli@wwkong-T7910:~/GSE81436/data$ ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 200m anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR350/SRR3503145/SRR3503145.sra /home/dingli/GSE81436/download

dingli@wwkong-T7910:~/GSE81436/data$ ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 200m anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR350/SRR3503146/SRR3503146.sra /home/dingli/GSE81436/download

dingli@wwkong-T7910:~/GSE81436/data$ ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 200m anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR350/SRR3503148/SRR3503148.sra /home/dingli/GSE81436/download

1.2 序列文件及注释下载

dingli@wwkong-T7910:~/GSE81436$ mkdir -r ref

dingli@wwkong-T7910:~/GSE81436$ cd ref

dingli@wwkong-T7910:~/GSE81436$ wget http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/all.con

dingli@wwkong-T7910:~/GSE81436$ wget http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/all.gff3

二、分析流程

2.1 建立数据分析目录

在分析之前,在当前工作目录下建立三个文件夹及相应的文件,文件结构如下:

  • index文件夹存放水稻基因组的注释文件及其相应索引文件;

  • genome文件夹存放水稻的基因组序列文件;

  • data文件夹存放水稻在两种不同条件下测序的RNA-Seq数据。


2.2 准备数据和软件

2.2.1 软件准备:

Tophat2(v2.0.12):短读段映射。

Bowtie2(v2.2.3):一个超快的短读段比对工具。

Cufflinks(v2.0.0):转录组装配。

HTseq:用于NGS数据分析的python脚本。

DESeq2:R的bioconductor包进行差异表达分析。

Samtools(v0.1.18):读段分析工具。

CummeRbund:R的bioconductor包进行差异表达可视化分析。

2.2.2 将sra文件解压缩:

dingli@wwkong-T7910:~/GSE81436/data$ fastq-dump --gzip --split-3 SRR3503143.sra

dingli@wwkong-T7910:~/GSE81436/data$ fastq-dump --gzip --split-3 SRR3503145.sra 

dingli@wwkong-T7910:~/GSE81436/data$ fastq-dump --gzip --split-3 SRR3503146.sra 

dingli@wwkong-T7910:~/GSE81436/data$ fastq-dump --gzip --split-3 SRR3503148.sra 

2.3 查看数据质量并进行过滤

首先查看一下fastq格式数据内容:

dingli@wwkong-T7910:~/GSE81436/data$ less SRR3503143_1.fq

2.3.1 使用fastqc软件查看数据质量:

dingli@wwkong-T7910:~/GSE81436/data$ fastqc -o qc_untrim -t 10 ./*.fastq.gz

会生成 XXX_1.fq_fastqc XXX_2.fq_fastqc等共计八个文件夹。

2.3.2 数据过滤:用Trimmomatic

dingli@wwkong-T7910:~/Trimmomatic-0.38$ java -jar trimmomatic-0.38.jar PE -phred33 /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503143_1.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503143_2.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503143_1_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503143_1_unpaired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503143_2_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503143_2_unpaired.fastq.gz ILLUMINACLIP:/home/dingli/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

dingli@wwkong-T7910:~/Trimmomatic-0.38$ java -jar trimmomatic-0.38.jar PE -phred33 /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503145_1.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503145_2.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503145_1_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503145_1_unpaired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503145_2_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503145_2_unpaired.fastq.gz ILLUMINACLIP:/home/dingli/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

dingli@wwkong-T7910:~/Trimmomatic-0.38$ java -jar trimmomatic-0.38.jar PE -phred33 /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503146_1.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503146_2.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503146_1_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503146_1_unpaired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503146_2_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503146_2_unpaired.fastq.gz ILLUMINACLIP:/home/dingli/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

dingli@wwkong-T7910:~/Trimmomatic-0.38$ java -jar trimmomatic-0.38.jar PE -phred33 /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503148_1.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503148_2.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503148_1_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503148_1_unpaired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503148_2_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503148_2_unpaired.fastq.gz ILLUMINACLIP:/home/dingli/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

2.3.3 再次用fastqc查看数据质量

dingli@wwkong-T7910:~/GSE81436/data$ fastqc -o qc_trim -t 10 ./*paired.fastq.gz

2.4 使用Bowtie2建立参考基因组的索引

dingli@wwkong-T7910:~/GSE81436/index$ bowtie2-build rice.fa rice

输入该命令后,文件夹内会产生:xxx.1.bt2、xxx.2.bt2、xxx.3.bt2、xxx.4.bt2、xxx.rev.1.bt2、xxx.rev.2.bt2


2.5 将reads mapping到参考基因组——基于Bowtie2的Tophat2:

dingli@wwkong-T7910:~/GSE81436a$ tophat -p 8 -o ./tophat2_result/Cr-DJ-1 -G ./index/rice.gtf rice ./data/trim_SRR3503143_1_paired.fastq ./data/trim_SRR3503143_2_paired.fastq

dingli@wwkong-T7910:~/GSE81436$ tophat -p 8 -o ./tophat2_result/Cr-DJ-2 -G ./index/rice.gtf rice ./data/trim_SRR3503145_1_paired.fastq ./data/trim_SRR3503145_2_paired.fastq

dingli@wwkong-T7910:~/GSE81436$ tophat -p 8 -o ./tophat2_result/osddm1a1b-1 -G ./index/rice.gtf rice ./data/trim_SRR3503146_1_paired.fastq ./data/trim_SRR3503146_2_paired.fastq

dingli@wwkong-T7910:~/GSE81436$ tophat -p 8 -o ./tophat2_result/osddm1a1b-2 -G ./index/rice.gtf rice ./data/trim_SRR3503148_1_paired.fastq ./data/trim_SRR3503148_2_paired.fastq

执行以上命令后,会生成目录,里面会产生accepted_hits.bam文件。

参数说明

-p 8 表示使用8个线程

-o CrDJ-1表示结果文件输出到CrDJ-1文件夹中(不需要自己提前创建)

-G 后面跟基因组注释文件

2.5 差异表达分析——HTSeq-count和DESeq:分析差异表达基因的工具

2.5.1 使用HTSeq-count对accepted_hits.bam中的count进行计数:

注意:对于双端测序数据,首先要对bam文件进行排序:

(1)对bam文件进行排序:

dingli@wwkong-T7910:~/GSE81436$ samtools sort -n accepted_hits.bam -o accepted_hits.sorted.bam

(2)对count进行计数:

htseq-count -f bam --stranded=no --type=exon --idattr=gene_id accepted_hits.sorted.bam rice.gtf > Cr-DJ-1.count

dingli@wwkong-T7910:~/GSE81436$ htseq-count -f bam --stranded=no --type=exon --idattr=gene_id accepted_hits.sorted.bam rice.gtf > Cr-DJ-2.count

dingli@wwkong-T7910:~/GSE81436$ htseq-count -f bam --stranded=no --type=exon --idattr=gene_id accepted_hits.sorted.bam rice.gtf > osddm1a1b-1.count

dingli@wwkong-T7910:~/GSE81436$ htseq-count -f bam --stranded=no --type=exon --idattr=gene_id accepted_hits.sorted.bam rice.gtf > osddm1a1b-2.count

(3)合并各个count文件:

dingli@wwkong-T7910:~/GSE81436/htseq_result$ paste Cr-DJ-1.count Cr-DJ-2.count osddm1a1b-1.count osddm1a1b-2.count | awk '{printf $1"\t";for(i=2;i<=NF;i=i+2)printf $i"\t";print $i}' > Cr-DJ_osddm1a1b_matrix.out

2.5.2 使用DESeq2进行差异分析

setwd("/home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/DESeq_result")
countdata <- read.table("Cr-DJ_osddm1a1b_matrix.out", header=TRUE, row.names=1)
colnames(countdata) <- c("DJ-1","DJ-2","osddm1a1b-1","osddm1a1b-2")
condition <- factor(c("DJ","DJ","osddm1a1b","osddm1a1b"))
coldata <- data.frame(row.names=colnames(countdata), condition)
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)
dds <- DESeq(dds)
res <- results(dds)
table(res$padj<0.01)
res_padj <- res[order(res$padj), ]
write.csv(res, file="diffexpr_padj_results.csv",row.names = T)

subset(res,padj < 0.01) -> diff
subset(diff,log2FoldChange < -1) -> down
subset(diff,log2FoldChange > 1) -> up

as.data.frame(down) -> down_gene
as.data.frame(up) -> up_gene
write.csv(up_gene, file="up_gene_LogFC1.csv",row.names = T)
write.csv(down_gene, file="down_gene_LogFC1.csv",row.names = T)

重要说明:

  • 1.如果是pair-end数据,则必须按照reads名称排序后,再使用htseq-count:

    samtools sort -n accepted_hits.bam -o accepted_hits.sorted.bam

  • 2.htseq-count:-i参数默认为gene_id,transcript_id由于模型问题,计数效果不佳。

  • 3.DESeq2需要安装很多依赖包,安装需要耐心。