正所谓前人栽树,后人乘凉。
本文参考来源:《RNA-seq数据分析实用方法》、《生物信息学计算技术和软件导论》、《新一代测序数据分析》、网上的博客及本人自己的RNA-seq分析过程。
一、数据准备
-
双端测序的fastq文件:control_1.fastq、control_2.fastq、treatment_1.fastq、treatment_2.fastq(举例)。
-
全基因组文件:genome_xxx.fa
-
gtf注释文件:genome_xxx.gtf
1.1 测序数据下载
推荐使用IBM的aspera,速度非常快。
dingli@wwkong-T7910:~/GSE81436/data$ ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 200m anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR350/SRR3503143/SRR3503143.sra /home/dingli/GSE81436/download
dingli@wwkong-T7910:~/GSE81436/data$ ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 200m anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR350/SRR3503145/SRR3503145.sra /home/dingli/GSE81436/download
dingli@wwkong-T7910:~/GSE81436/data$ ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 200m anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR350/SRR3503146/SRR3503146.sra /home/dingli/GSE81436/download
dingli@wwkong-T7910:~/GSE81436/data$ ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 200m anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR350/SRR3503148/SRR3503148.sra /home/dingli/GSE81436/download
1.2 序列文件及注释下载
dingli@wwkong-T7910:~/GSE81436$ mkdir -r ref
dingli@wwkong-T7910:~/GSE81436$ cd ref
dingli@wwkong-T7910:~/GSE81436$ wget http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/all.con
dingli@wwkong-T7910:~/GSE81436$ wget http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/all.gff3
二、分析流程
2.1 建立数据分析目录
在分析之前,在当前工作目录下建立三个文件夹及相应的文件,文件结构如下:
-
index文件夹存放水稻基因组的注释文件及其相应索引文件;
-
genome文件夹存放水稻的基因组序列文件;
-
data文件夹存放水稻在两种不同条件下测序的RNA-Seq数据。
2.2 准备数据和软件
2.2.1 软件准备:
Tophat2(v2.0.12):短读段映射。
Bowtie2(v2.2.3):一个超快的短读段比对工具。
Cufflinks(v2.0.0):转录组装配。
HTseq:用于NGS数据分析的python脚本。
DESeq2:R的bioconductor包进行差异表达分析。
Samtools(v0.1.18):读段分析工具。
CummeRbund:R的bioconductor包进行差异表达可视化分析。
2.2.2 将sra文件解压缩:
dingli@wwkong-T7910:~/GSE81436/data$ fastq-dump --gzip --split-3 SRR3503143.sra
dingli@wwkong-T7910:~/GSE81436/data$ fastq-dump --gzip --split-3 SRR3503145.sra
dingli@wwkong-T7910:~/GSE81436/data$ fastq-dump --gzip --split-3 SRR3503146.sra
dingli@wwkong-T7910:~/GSE81436/data$ fastq-dump --gzip --split-3 SRR3503148.sra
2.3 查看数据质量并进行过滤
首先查看一下fastq格式数据内容:
dingli@wwkong-T7910:~/GSE81436/data$ less SRR3503143_1.fq
2.3.1 使用fastqc软件查看数据质量:
dingli@wwkong-T7910:~/GSE81436/data$ fastqc -o qc_untrim -t 10 ./*.fastq.gz
会生成 XXX_1.fq_fastqc XXX_2.fq_fastqc等共计八个文件夹。
2.3.2 数据过滤:用Trimmomatic
dingli@wwkong-T7910:~/Trimmomatic-0.38$ java -jar trimmomatic-0.38.jar PE -phred33 /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503143_1.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503143_2.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503143_1_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503143_1_unpaired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503143_2_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503143_2_unpaired.fastq.gz ILLUMINACLIP:/home/dingli/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
dingli@wwkong-T7910:~/Trimmomatic-0.38$ java -jar trimmomatic-0.38.jar PE -phred33 /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503145_1.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503145_2.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503145_1_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503145_1_unpaired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503145_2_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503145_2_unpaired.fastq.gz ILLUMINACLIP:/home/dingli/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
dingli@wwkong-T7910:~/Trimmomatic-0.38$ java -jar trimmomatic-0.38.jar PE -phred33 /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503146_1.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503146_2.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503146_1_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503146_1_unpaired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503146_2_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503146_2_unpaired.fastq.gz ILLUMINACLIP:/home/dingli/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
dingli@wwkong-T7910:~/Trimmomatic-0.38$ java -jar trimmomatic-0.38.jar PE -phred33 /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503148_1.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/SRR3503148_2.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503148_1_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503148_1_unpaired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503148_2_paired.fastq.gz /home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/trim_SRR3503148_2_unpaired.fastq.gz ILLUMINACLIP:/home/dingli/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
2.3.3 再次用fastqc查看数据质量
dingli@wwkong-T7910:~/GSE81436/data$ fastqc -o qc_trim -t 10 ./*paired.fastq.gz
2.4 使用Bowtie2建立参考基因组的索引
dingli@wwkong-T7910:~/GSE81436/index$ bowtie2-build rice.fa rice
输入该命令后,文件夹内会产生:xxx.1.bt2、xxx.2.bt2、xxx.3.bt2、xxx.4.bt2、xxx.rev.1.bt2、xxx.rev.2.bt2
2.5 将reads mapping到参考基因组——基于Bowtie2的Tophat2:
dingli@wwkong-T7910:~/GSE81436a$ tophat -p 8 -o ./tophat2_result/Cr-DJ-1 -G ./index/rice.gtf rice ./data/trim_SRR3503143_1_paired.fastq ./data/trim_SRR3503143_2_paired.fastq
dingli@wwkong-T7910:~/GSE81436$ tophat -p 8 -o ./tophat2_result/Cr-DJ-2 -G ./index/rice.gtf rice ./data/trim_SRR3503145_1_paired.fastq ./data/trim_SRR3503145_2_paired.fastq
dingli@wwkong-T7910:~/GSE81436$ tophat -p 8 -o ./tophat2_result/osddm1a1b-1 -G ./index/rice.gtf rice ./data/trim_SRR3503146_1_paired.fastq ./data/trim_SRR3503146_2_paired.fastq
dingli@wwkong-T7910:~/GSE81436$ tophat -p 8 -o ./tophat2_result/osddm1a1b-2 -G ./index/rice.gtf rice ./data/trim_SRR3503148_1_paired.fastq ./data/trim_SRR3503148_2_paired.fastq
执行以上命令后,会生成目录,里面会产生accepted_hits.bam文件。
参数说明
-p 8 表示使用8个线程
-o CrDJ-1表示结果文件输出到CrDJ-1文件夹中(不需要自己提前创建)
-G 后面跟基因组注释文件
2.5 差异表达分析——HTSeq-count和DESeq:分析差异表达基因的工具
2.5.1 使用HTSeq-count对accepted_hits.bam中的count进行计数:
注意:对于双端测序数据,首先要对bam文件进行排序:
(1)对bam文件进行排序:
dingli@wwkong-T7910:~/GSE81436$ samtools sort -n accepted_hits.bam -o accepted_hits.sorted.bam
(2)对count进行计数:
htseq-count -f bam --stranded=no --type=exon --idattr=gene_id accepted_hits.sorted.bam rice.gtf > Cr-DJ-1.count
dingli@wwkong-T7910:~/GSE81436$ htseq-count -f bam --stranded=no --type=exon --idattr=gene_id accepted_hits.sorted.bam rice.gtf > Cr-DJ-2.count
dingli@wwkong-T7910:~/GSE81436$ htseq-count -f bam --stranded=no --type=exon --idattr=gene_id accepted_hits.sorted.bam rice.gtf > osddm1a1b-1.count
dingli@wwkong-T7910:~/GSE81436$ htseq-count -f bam --stranded=no --type=exon --idattr=gene_id accepted_hits.sorted.bam rice.gtf > osddm1a1b-2.count
(3)合并各个count文件:
dingli@wwkong-T7910:~/GSE81436/htseq_result$ paste Cr-DJ-1.count Cr-DJ-2.count osddm1a1b-1.count osddm1a1b-2.count | awk '{printf $1"\t";for(i=2;i<=NF;i=i+2)printf $i"\t";print $i}' > Cr-DJ_osddm1a1b_matrix.out
2.5.2 使用DESeq2进行差异分析
setwd("/home/dingli/GSE81436/RNA-seq_Cr-DJ_osddm1a1b/DESeq_result")
countdata <- read.table("Cr-DJ_osddm1a1b_matrix.out", header=TRUE, row.names=1)
colnames(countdata) <- c("DJ-1","DJ-2","osddm1a1b-1","osddm1a1b-2")
condition <- factor(c("DJ","DJ","osddm1a1b","osddm1a1b"))
coldata <- data.frame(row.names=colnames(countdata), condition)
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)
dds <- DESeq(dds)
res <- results(dds)
table(res$padj<0.01)
res_padj <- res[order(res$padj), ]
write.csv(res, file="diffexpr_padj_results.csv",row.names = T)
subset(res,padj < 0.01) -> diff
subset(diff,log2FoldChange < -1) -> down
subset(diff,log2FoldChange > 1) -> up
as.data.frame(down) -> down_gene
as.data.frame(up) -> up_gene
write.csv(up_gene, file="up_gene_LogFC1.csv",row.names = T)
write.csv(down_gene, file="down_gene_LogFC1.csv",row.names = T)
重要说明:
-
1.如果是pair-end数据,则必须按照reads名称排序后,再使用htseq-count:
samtools sort -n accepted_hits.bam -o accepted_hits.sorted.bam
-
2.htseq-count:-i参数默认为gene_id,transcript_id由于模型问题,计数效果不佳。
-
3.DESeq2需要安装很多依赖包,安装需要耐心。