2020-10-09-RNA-seq转录组数据分析入门

入门知识及实战pipline

Posted by DL on October 9, 2020

一、RNA-seq分析的介绍

1. Recent Developments

1.1 Advances in RNA-seq technologies

BH7z4A.png

1.2 Different Pipelines

BH7ypq.png

1.3 General Workflow

BH7SyT.png


2. Prerequisite knowledge

2.1 Data Format

  • (1) FASTA

BH75N9.png

  • (2) FASTQ

BH7OBD.png

  • (3) Phred quality scores

BHHVEQ.png

  • (4) ASCII table

BHHMvV.png

BHHdv6.png

  • (5) Quality scores

BHb9IJ.png

  • (6) GTF & GFF

BHb1zt.png


3. Data Sources

BHb2o4.png

  • (1) GEO (Gene Expression Omnibus)

BHbhWR.png

  • (2) Get FASTQ file from GEO

BHbqTe.png


4. Computing Environment

BHqAYj.png


二、准备工作

2.1 安装conda及相关软件

conda create -n scRNA-seq python=2
conda info --env
source activate scRNA-seq
conda install -y sra-tools cutadapt multiqc trim-galore subread hisat2

2.2 下载sra数据

#激活conda的scRNA-seq环境
conda activate scRNA-seq
cd /mnt/e/scRNA

#下载sra数据
mkdir raw_sra
wget -i download.txt

#检查数据完整性:md5值
#给文件生成md5值
md5sum *gz > md5.txt

#比对已有的md5值
md5sum -c md5.txt

2.3 sra文件转fastq文件

mkdir raw_fq
ls raw_sra/* | while read id; do (fastq-dump --gzip --split-e $id); done

三、数据质控

#查看qc前结果
for i in 'ls *gz';do fastqc $i;done
#或者
ls *gz | xargs -I [] echo 'nohup fastqc [] &'

#使用multiqc合并qc结果
multiqc ./

#用Trimmomatic过滤低质量序列
#注:接头序列的选择
#"Illumina Single End" / "Illumina Paired End": "TruSeq2-SE.fa" and "TruSeq3-PE.fa"
#"TruSeq Universal Adapter" / "TruSeq Adapter,Index...": "TruSeq3-SE.fa" and "TruSeq3-PE.fa"
#注:去接头参数的选择:true(双端测序选此) / false

#代码较长,如下图所示:

0RexW8.png

#查看qc后结果
for i in 'ls *gz';do fastqc $i;done

四、序列比对

(1)基于基因组比对(以染色体为单位):

  • STAR
  • Hisat2

(2)基于转录本比对(以转录本为单位):

  • RSEM

4.1 STAR操作实例

(1)建立索引:

STAR --runThreadN 6 --runMode genomeGenerate \
--genomeDir arab_STAR_genome \
--genomeFastaFiles 00ref/TAIR10_Chr.all.fasta \
--sjdbGTFfile 00ref/Araport11_GFF3_genes_transposons.201606.gtf \
--sjdbOverhang 149 #预测可变剪接

(2)序列比对:

STAR --runThreadN 5 --genomeDir arab_STAR_genome \
--readFilesCommand zcat \
--readFilesIn 02clean_data/sample1_paired_clean_R1.fastq.gz \
02clean_data/sample1_paired_clean_R2.fastq.gz \
--outFileNamePrefix 03align_out/sample1_ \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 5 \
--quantMode TranscriptomeSAM GeneCounts

4.2 Hisat2操作实例

(1)建立index:

mkdir reference
hisat2_extract_exons.py GRCm38.gtf  > exons.txt
hisat2_extract_splice_sites.py GRCm38.gtf > ss.txt
hisat2-build -p 8 --ss ss.txt --exon exons.txt GRCm38.fa GRCm38

(2)将fq文件比对至参考基因组:

index=/mnt/e/scRNA/reference/GRCm38
ls raw_fq/*gz | while read id; do (hisat2 -p 8 -x $index -U $id -S ${id%%.*}.hisat.sam); done

#sam格式转bam格式
mkdir align
mv raw_fq/*sam align/
ls *.sam | while read id; do (samtools sort -O bam -@ 5 -o $(basename ${id} ".sam").bam ${id});done

#为bam文件建立索引
ls *.bam | xargs -i samtools index {}

五、计算表达量

(1)STAR+RSEM:

  • 输出结果可以选择转录本定量或者基因定量
  • 定量单位包括feature count、FPKM、TPM
  • 操作相对复杂

(2)STAR+HTSeq:

  • 输出结果为原始read count
  • 结果可用于差异表达分析
  • 操作相对简单

(3)STAR+featureCounts:

  • STAR+featureCounts = STAR+HTSeq升级版
  • 安装:conda install subread
  • 优点:快,非常快

(4)Kallisto (free-alignment):

  • 速度快省内存
  • 基于转录本定量
  • 不产生bam文件不方便其他后续分析

5.1 STAR+RSEM演示

  • 准备定量分析所需文件
  • 利用STAR结果进行定量分析
# rsem prepare reference
rsem-prepare-reference --gtf 00ref/Araport11_GFF3_genes_transposons.201606.gtf \
00ref/TAIR10_Chr.all.fasta \
arab_RSEM/arab_rsem

rsem-calculate-expression --paired-end --no-bam-output \
--alignments -p 5 \
-q 03align_out/sample2_AlignedtoTranscriptome.out.bam \
arab_RSEM/arab_rsem \
04rsem_out/sample2_rsem

# htseq-count
htseq-count -r pos -m union -f bam -s no \
-q 03align_out/sample2Aligned.sortedByCoord.out.bam >
05htseq_out/sample2.htseq.out

5.2 Kallisto演示

  • 利用转录本参考序列文件构建索引
  • 进行无比对定量分析
# build index
kallisto index -i arab_kallisto ../arab_RSEM/arab_rsem.transcripts.fa

# count
kallisto quant -i arab_kallisto/arab_kallisto -o 05kallisto_out/sample2 \
02clean_data/sample2_paired_clean_R1.fastq.gz \
02clean_data/sample2_paired_clean_R2.fastq.gz

5.3 featureCounts演示

# 使用featureCounts进行定量
featureCounts -p -a ../00ref/Araport11_GFF3_genes_transposons.201606.gtf \
-o out_counts.txt -T 6 -t exon -g gene_id sample*_Aligned.sortedByCoord.out.bam

六、差异分析

6.1 表达定量结果转换为表达矩阵

  • RSEM自带脚本
  • 去除所有样本表达量全部为0的基因
mkdir 06deseq_out

# 构建表达矩阵
rsem-generate-data-matrix *_rsem.genes.results > output.matrix

# 删除未检测到表达的基因
awk 'BEGIN{printf "geneid\ta1\ta2\tb1\tb2\n"}{if($2+$3+$4+$5>0)print $0}' output.matrix > deseq2_input.txt


6.2 使用DESeq2进行差异表达分析

接下来在R-studio中操作:

# 读取文件
input_data <- read.table("deseq2_input.txt",header=TRUE,row.names=1)

# 取整
input_data <- round(input_data,digits=0)

# 准备工作
input_data <- as.matrix(input_data)
condition <- factor(c(rep("ctl",2),rep("exp",2)))
coldata <- data.frame(row.names=colnames(input_data),condition)
library(DESeq2)

# 构建deseq输入矩阵
dds <- DESeqDataSetFromMatrix(countData=input_data,colData=coldata,design=~condition)
# DESeq2进行差异分析
dds <- DESeq(dds)
# 提取结果
res <- results(dds)
summary(res)
res <- res[order(res$padj),]
resdata <- merge(as.data.frame(res),as.data.frame(counts(dds,normalized=TRUE)),by="row.names",sort=FALSE)
names(resdata)[1] <- "Gene"

6.3 可视化展示

# 可视化展示
maplot <- function(res,thresh=0.05,labelsig=TRUE, ...){
	with(res,plot(baseMean,log2FoldChange,pch=20,cex=0.5,log="x",...))
	with(subset(res,padj(thresh),points(baseMean,log2FoldChange,col="red",pch=20,cex=1.5)))
}
png("diffexor-maplot.png",1500,1000,pointsize=20)
maplot(resdata,main="MA Plot")
dev.off()


install.packages("ggrepel")
library(ggplot2)
library(ggrepel)

resdata$significant <- as.factor(resdata$padj<0.05 & abs(resdata$log2FoldChange)>1)
ggplot(data=resdata,aes(x=log2FoldChange,y=-log10(padj),color=significant))+geom_point()+ylim(0,8)+scale_color_manual(values=c("black","red"))+geom_hline(yintercept=-log10(0.05),lty=4,lwd=0.6,alpha=0.8)+
geom_vline(xintercept=c(1,-1),lty=4,lwd=0.6,alpha=0.8)+theme_bw()+theme(panel.grid.major=element_blank(),panel.grid.minor=element_blank(),axis.line=element_line(colour="black"))+itle="Volcanoplot",x="log2 (fold change)",y="-log10 (padj)")+plot.title=element_text(hjust=0.5)+ext_repel(data=subset(resdata,-log10(padj)>6),aes(label=Gene),col="black",alpha=0.8)


6.4 筛选上下调基因

# 筛选上下调基因
wc -l diffexpr-results.txt

awk '{if ($3>1 && $7<0.05) print $0}' diffexpr-results.txt | cut -f 1,2,4,7 > up.gene.txt

awk '{if ($3<-1 && $7<0.05) print $0}' diffexpr-results.txt | cut -f 1,2,4,7 > down.gene.txt


七、后续的富集分析

BHLdvq.png

7.1 Gene ontology

  Gene ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species.  The ontology covers three domains: cellular component, molecular function and biological process.

BHL2G9.png

7.2 KEGG & PANTHER

BHLvsP.png