2020-12-17-单细胞转录组初探系列二

单细胞优秀示例之Scater包

Posted by DL on December 17, 2020

来源:生信技能树-jimmy-B站 来源:生信技能树-jimmy-github

0.引言

教程,当然是以官网为主,不过看英文笔记有挑战,简略带领大家一起学习咯: Biconductor-Scater

值得提醒的是 2017年 11 月 这个 scater 包经过了重大变革,所以如果大家看到比较旧的教程需要注意一下,通常是无法成功的。

其GitHub的教程:hemberg-lab

1.载入必要的R包

需要自行下载安装一些必要的R包! 因为大量学员在中国大陆,通常不建议大家使用下面的R包安装方法,建议是切换镜像后再下载R包。参考:生信技能树

if (!requireNamespace("Rtsne"))
    install.packages("Rtsne")
if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
if (!requireNamespace("scater"))
    BiocManager::install("scater")
if (!requireNamespace("scRNAseq"))
    BiocManager::install("scRNAseq")
if (!requireNamespace("destiny"))
    BiocManager::install("destiny")
if (!requireNamespace("SC3"))
    BiocManager::install("SC3")

加载R包

rm(list = ls()) # clear the environment
#load all the necessary libraries
options(warn=-1) # turn off warning message globally
suppressMessages(library(scater))
suppressMessages(library(scRNAseq))

2.创建测试数据集

我们选择 scRNAseq 这个R包。 这个包内置的是 Pollen et al. 2014 数据集,人类单细胞细胞,分成4类,分别是 pluripotent stem cells 分化而成的 neural progenitor cells (“NPC”) ,还有 “GW16” and “GW21” ,“GW21+3” 这种孕期细胞。大小是50.6 MB,下载需要一点点时间,先安装加载它们。

这个数据集很出名,截止2019年1月已经有近400的引用了,后面的人开发R包算法都会在其上面做测试,比如 SinQC 这篇文章就提到:We applied SinQC to a highly heterogeneous scRNA-seq dataset containing 301 cells (mixture of 11 different cell types) (Pollen et al., 2014).

不过本例子只使用了数据集的4种细胞类型而已,因为 scRNAseq 这个R包就提供了这些,完整的数据是 23730 features, 301 samples 在 https://hemberg-lab.github.io/scRNA.seq.datasets/human/tissues/

这里面的表达矩阵是由 RSEM (Li and Dewey 2011) 软件根据 hg38 RefSeq transcriptome 得到的,总是130个文库,每个细胞测了两次,测序深度不一样。

library(scRNAseq)
## ----- Load Example Data -----
fluidigm <- ReprocessedFluidigmData()
fluidigm
# Set assay to RSEM estimated counts
assay(fluidigm)  <- assays(fluidigm)$rsem_counts
ct <- floor(assays(fluidigm)$rsem_counts)
ct[1:4,1:4]
##          SRR1275356 SRR1274090 SRR1275251 SRR1275287
## A1BG              0          0          0          0
## A1BG-AS1          0          0          0          0
## A1CF              0          0          0          0
## A2M               0          0          0         33
table(rowSums(ct)==0)
## 
## FALSE  TRUE 
## 17096  9159
# 这里使用原始表达矩阵,所以有很多基因在所有细胞均无表达量,即表现为没有被检测到,这样的基因是需要过滤掉的。

pheno_data <- as.data.frame(colData(fluidigm))
## 这里需要把Pollen的表达矩阵做成我们的 scater 要求的对象
#data("sc_example_counts")
#data("sc_example_cell_info") 
# 你也可以尝试该R包自带的数据集。
# 参考 https://bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/vignette-intro.R
sce <- SingleCellExperiment(
    assays = list(counts = ct), 
    colData = pheno_data
    )
sce
## class: SingleCellExperiment 
## dim: 26255 130 
## metadata(0):
## assays(1): counts
## rownames(26255): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(130): SRR1275356 SRR1274090 ... SRR1275366 SRR1275261
## colData names(28): NREADS NALIGNED ... Cluster1 Cluster2
## reducedDimNames(0):
## spikeNames(0):
# 后面所有的分析都是基于 sce 这个变量
# 是一个 SingleCellExperiment 对象,被很多单细胞R包采用。

3.一些质量控制

这个也非常复杂,推荐看原文说明书: https://www.bioconductor.org/packages/devel/bioc/vignettes/scater/inst/doc/vignette-qc.html

这里稍作讲解,包括基因层面及细胞层面,但是并不设计实验本身引入的干扰因素。

3.1 首先看基因层面的过滤

使用 calculateQCMetrics 函数作用于 sce 那个单细胞数据对象后,就可以 用 rowData(object) 可以查看各个基因统计指标:

  • mean_counts: 平均表达量
  • log10_mean_counts: 归一化 log10-scale.
  • pct_dropout_by_counts: 该基因丢失率。
  • n_cells_by_counts: 多少个细胞表达了该基因

上面那些指标可以用来过滤,也可以自己重新再次计算一下那些统计学指标。

主要是过滤低表达量基因,还有 线粒体基因 和 ERCC spike-ins 的控制。

exprs(sce) <- log2(
    calculateCPM(sce ) + 1)
## 只有运行了下面的函数后才有各式各样的过滤指标
genes <- rownames(rowData(sce))
genes[grepl('^MT-',genes)]
## character(0)
genes[grepl('^ERCC-',genes)]
## character(0)
# 比较不幸的是,这个测试数据里面既没有线粒体基因,又没有ERCC序列。
sce <- calculateQCMetrics(sce, 
        feature_controls = list(ERCC = grep('^ERCC',genes)))
# 也没有定义啥细胞是属于control组,所以这里并不需要完全follow教程
# example_sce <- perCellQCMetrics(example_sce, 
#     feature_controls = list(ERCC = 1:20, mito = 500:1000),
#     cell_controls = list(empty = 1:5, damaged = 31:40))

查看信息

tmp <- as.data.frame(rowData(sce))
colnames(tmp)
## [1] "is_feature_control"      "is_feature_control_ERCC"
## [3] "mean_counts"             "log10_mean_counts"      
## [5] "n_cells_by_counts"       "pct_dropout_by_counts"  
## [7] "total_counts"            "log10_total_counts"
head(tmp)
##          is_feature_control is_feature_control_ERCC  mean_counts
## A1BG                  FALSE                   FALSE 1.553846e+00
## A1BG-AS1              FALSE                   FALSE 1.015385e+00
## A1CF                  FALSE                   FALSE 7.692308e-03
## A2M                   FALSE                   FALSE 1.505154e+02
## A2M-AS1               FALSE                   FALSE 1.407692e+00
## A2ML1                 FALSE                   FALSE 1.538462e-01
##          log10_mean_counts n_cells_by_counts pct_dropout_by_counts
## A1BG           0.407194731                10              92.30769
## A1BG-AS1       0.304357939                 2              98.46154
## A1CF           0.003327943                 1              99.23077
## A2M            2.180456733                21              83.84615
## A2M-AS1        0.381600985                 3              97.69231
## A2ML1          0.062147907                 9              93.07692
##          total_counts log10_total_counts
## A1BG              202           2.307496
## A1BG-AS1          132           2.123852
## A1CF                1           0.301030
## A2M             19567           4.291546
## A2M-AS1           183           2.264818
## A2ML1              20           1.322219

目前只过滤那些在所有细胞都没有表达的基因, 但是这个过滤条件可以自行调整。可以看到基因数量大幅度减少。

keep_feature <- rowSums(exprs(sce) > 0) > 0
table(keep_feature)
## keep_feature
## FALSE  TRUE 
##  9159 17096
sce <- sce[keep_feature,]
sce
## class: SingleCellExperiment 
## dim: 17096 130 
## metadata(0):
## assays(2): counts logcounts
## rownames(17096): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(8): is_feature_control is_feature_control_ERCC ...
##   total_counts log10_total_counts
## colnames(130): SRR1275356 SRR1274090 ... SRR1275366 SRR1275261
## colData names(64): NREADS NALIGNED ...
##   pct_counts_in_top_200_features_ERCC
##   pct_counts_in_top_500_features_ERCC
## reducedDimNames(0):
## spikeNames(0):

3.2 然后看细胞层面的过滤

用 colData(object) 可以查看各个样本统计情况

  • total_counts: total number of counts for the cell (aka ‘library size’)

  • log10_total_counts: total_counts on the log10-scale

  • total_features: the number of features for the cell that have expression above the detection limit (default detection limit is zero)

等等

每个数据集都有适合自己的过滤阈值,不要局限于教程

tmp <- as.data.frame(colData(sce))
colnames(tmp)
##  [1] "NREADS"                                        
##  [2] "NALIGNED"                                      
##  [3] "RALIGN"                                        
##  [4] "TOTAL_DUP"                                     
##  [5] "PRIMER"                                        
##  [6] "INSERT_SZ"                                     
##  [7] "INSERT_SZ_STD"                                 
##  [8] "COMPLEXITY"                                    
##  [9] "NDUPR"                                         
## [10] "PCT_RIBOSOMAL_BASES"                           
## [11] "PCT_CODING_BASES"                              
## [12] "PCT_UTR_BASES"                                 
## [13] "PCT_INTRONIC_BASES"                            
## [14] "PCT_INTERGENIC_BASES"                          
## [15] "PCT_MRNA_BASES"                                
## [16] "MEDIAN_CV_COVERAGE"                            
## [17] "MEDIAN_5PRIME_BIAS"                            
## [18] "MEDIAN_3PRIME_BIAS"                            
## [19] "MEDIAN_5PRIME_TO_3PRIME_BIAS"                  
## [20] "sample_id.x"                                   
## [21] "Lane_ID"                                       
## [22] "LibraryName"                                   
## [23] "avgLength"                                     
## [24] "spots"                                         
## [25] "Biological_Condition"                          
## [26] "Coverage_Type"                                 
## [27] "Cluster1"                                      
## [28] "Cluster2"                                      
## [29] "is_cell_control"                               
## [30] "total_features_by_counts"                      
## [31] "log10_total_features_by_counts"                
## [32] "total_counts"                                  
## [33] "log10_total_counts"                            
## [34] "pct_counts_in_top_50_features"                 
## [35] "pct_counts_in_top_100_features"                
## [36] "pct_counts_in_top_200_features"                
## [37] "pct_counts_in_top_500_features"                
## [38] "total_features_by_counts_endogenous"           
## [39] "log10_total_features_by_counts_endogenous"     
## [40] "total_counts_endogenous"                       
## [41] "log10_total_counts_endogenous"                 
## [42] "pct_counts_endogenous"                         
## [43] "pct_counts_in_top_50_features_endogenous"      
## [44] "pct_counts_in_top_100_features_endogenous"     
## [45] "pct_counts_in_top_200_features_endogenous"     
## [46] "pct_counts_in_top_500_features_endogenous"     
## [47] "total_features_by_counts_feature_control"      
## [48] "log10_total_features_by_counts_feature_control"
## [49] "total_counts_feature_control"                  
## [50] "log10_total_counts_feature_control"            
## [51] "pct_counts_feature_control"                    
## [52] "pct_counts_in_top_50_features_feature_control" 
## [53] "pct_counts_in_top_100_features_feature_control"
## [54] "pct_counts_in_top_200_features_feature_control"
## [55] "pct_counts_in_top_500_features_feature_control"
## [56] "total_features_by_counts_ERCC"                 
## [57] "log10_total_features_by_counts_ERCC"           
## [58] "total_counts_ERCC"                             
## [59] "log10_total_counts_ERCC"                       
## [60] "pct_counts_ERCC"                               
## [61] "pct_counts_in_top_50_features_ERCC"            
## [62] "pct_counts_in_top_100_features_ERCC"           
## [63] "pct_counts_in_top_200_features_ERCC"           
## [64] "pct_counts_in_top_500_features_ERCC"
# 可以发现细胞质控属性非常多,可以说是实用至极。
## 比如看每个样本测到的基因数量
tf <- sce$total_features_by_counts 
boxplot(tf)

r8GblD.png

fivenum(tf)
## [1] 1418.0 2961.0 3841.5 5381.0 8221.0

还有很多其它指标,比如哪些细胞的ERCC含量过高,或者某些其高表达量基因占比太高,等等。


4.一些可视化

可视化函数非常多,所以这个R包才会被引用那么广泛,里面包括如下:

首先是一些细胞距离情况

  • plotPCA: produce a principal components plot for the cells.
  • plotTSNE: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.
  • plotDiffusionMap: produce a diffusion map (reduced dimension) plot for the cells.
  • plotMDS: produce a multi-dimensional scaling plot for the cells.
  • plotReducedDim: plot a reduced-dimension representation of the cells.

然后是一些表达量相关的:

  • plotExpression: plot expression levels for a defined set of features.
  • plotPlatePosition: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level.
  • plotColData: plot cell metadata and QC metrics.
  • plotRowData: plot feature metadata and QC metrics.

4.1 首先可视化表达量

可以添加一些细胞属性,展示如下:

# 挑选一些细胞属性(临床信息) 来进行可视化展示。
colnames(tmp)[25:28]
## [1] "Biological_Condition" "Coverage_Type"        "Cluster1"            
## [4] "Cluster2"
# 主要展现下面这些基因
rownames(sce)[1:6]
## [1] "A1BG"     "A1BG-AS1" "A1CF"     "A2M"      "A2M-AS1"  "A2ML1"

展示一些基因在不同细胞分类的表达区别

## --------------------
plotExpression(sce, rownames(sce)[1:6],
               x = "Biological_Condition", 
               exprs_values = "logcounts") 

r8JE0s.png

散点图,展示两个基因的相关性,这里批量展示6对相关性

## -------------------------------------------
plotExpression(sce, rownames(sce)[1:6],
               x = rownames(sce)[1])

r8Jekq.png

还可以更复杂

## ---------------
plotExpression(sce, rownames(sce)[1:6],
               colour_by = "Biological_Condition", 
               shape_by = "Cluster1", 
               size_by = rownames(sce)[1])

r8JKpT.png


4.2 还可以可视化细胞距离分布

这里可以学习PCA或者tSNE的使用

# 举例
sce <- runPCA(sce)
# 这里并没有进行任何基因的挑选,就直接进行了PCA,与 seurat包不一样。
reducedDimNames(sce)
## [1] "PCA"

PCA分布图上面添加临床信息,同样的可以看到GW16混在了GW21里面

## --------------
plotReducedDim(sce, use_dimred = "PCA", 
               colour_by = "Biological_Condition" )

r8Ja9K.png

PCA分布图上面添加表达量信息, 但这个基本上不怎么用

## ------------
plotReducedDim(sce, use_dimred = "PCA", 
               colour_by = rownames(sce)[1], 
               size_by = rownames(sce)[2])

r8JcNt.png

最原始的

## --------------
plotPCA(sce)

r8JHNq.png


4.3 其它降维算法结果的可视化

## ----plot-pca-feature-controls---------------------------------------------
# 这里在真实场景中应用比较多。
sce2 <- runPCA(sce, 
               feature_set = rowData(sce)$is_feature_control)
plotPCA(sce2)

r8JzDJ.png

仅仅是选取前20个PC, 可以看到绘图时候细胞分布并没有太大区别

## ----plot-pca-4comp-colby-shapeby------------------------------------------
sce <- runPCA(sce, ncomponents=20)
plotPCA(sce,  colour_by = "Biological_Condition" )

r8Y328.png

但是可以挑选指定的PC来可视化, 这里选择的是第4个PC

plotPCA(sce, ncomponents = 4,  
        colour_by = "Biological_Condition" )

r8YD2T.png

意义不大的展示

## ----plot-pca-4comp-colby-sizeby-exprs-------------------------------------
plotPCA(sce,
        colour_by = rownames(sce)[1], 
        size_by = rownames(sce)[2])

r8YcqJ.png

tSNE可视化

## ----plot-tsne-1comp-colby-sizeby-exprs------------------------------------
# Perplexity of 10 just chosen here arbitrarily. 
set.seed(1000)
# 这里的这个 perplexity 参数很重要
sce <- runTSNE(sce, perplexity=30)
plotTSNE(sce, 
         colour_by = "Biological_Condition" )

r8YhPx.png

## ----plot-tsne-from-pca----------------------------------------------------
set.seed(1000)
sce <- runTSNE(sce, perplexity=10, use_dimred="PCA", n_dimred=10)
plotTSNE(sce,  
         colour_by = "Biological_Condition" )

r8tkon.png

这一步会调用destiny

## ----plot-difmap-1comp-colby-sizeby-exprs----------------------------------
sce <- runDiffusionMap(sce)
plotDiffusionMap(sce,  
                 colour_by = "Biological_Condition" )

r8tZWV.png


5.关于SC3包

虽然 SC3包 跟 scater 联系很紧密,但毕竟不属于其知识点,这里就简单运行即可。

预估亚群, 这里显示有24个, 比较多

library(SC3) # BiocManager::install('SC3')
sce <- sc3_estimate_k(sce)
metadata(sce)$sc3$k_estimation
## [1] 24
rowData(sce)$feature_symbol <- rownames(rowData(sce))

一步运行sc3的所有分析, 相当耗费时间

这里kn表示的预估聚类数, 考虑到数据集是已知的,我们强行设置为4组, 具体数据要具体考虑。

# 耗费时间
kn <- 4 ## 这里可以选择 3:5 看多种分类结果。
sc3_cluster <- "sc3_4_clusters"
Sys.time()
## [1] "2019-03-11 17:44:18 CST"
sce <- sc3(sce, ks = kn, biology = TRUE)
Sys.time()
## [1] "2019-03-11 17:45:06 CST"

5.1 可视化展示部分, kn就是聚类数

热图: 比较先验分类和SC3的聚类的一致性

sc3_plot_consensus(sce, k = kn, show_pdata = c("Biological_Condition",sc3_cluster))

r8tIkn.png

展示表达量信息

sc3_plot_expression(sce, k = kn, show_pdata =  c("Biological_Condition",sc3_cluster))

r8tLXF.png

展示可能的标记基因

sc3_plot_markers(sce, k = kn, show_pdata =  c("Biological_Condition",sc3_cluster))

r8tzkR.png

在PCA上展示SC3的聚类结果

plotPCA(sce, colour_by =  sc3_cluster )

r8Ni6O.png

# sc3_interactive(sce)