Overview of scRNAseq analysis

scRNAseq case study form the OSCA book

Last updated on Feb 13, 2021 9 min read R

OSCA book

Un dimanche après-midi à lÎle de la Grande Jatte (Seurat)

Motivation
Data loading
QC
Normalization
Variance modelling/feature selection
Dimensionality reduction
Clustering
Interpretation

Motivation

The aim of this post is to show an overview of the common framework for analyzing Single-Cell RNA-seq (scRNAseq) data. Before diving in the data analysis, let’s look at the library prep on a 10X Genomics machine for scRNAseq:

Dr. Eric Chow gives a fantastic overview of how cells and barcodes are sorted with the Droseq method (min 7:08).

Once the data is generated the sequential steps to handle scRNAseq are summarized in this carton from Bioconductor site:

The central object of the pipeline is SingleCellExperiment, which holds all the raw data and experiment information, and it looks like this:

The OSCA book is an amazing resource to fully understand the math behind scRNAseq analysis workflow. It has 18 cases studies of scRNAseq.

The main steps followed in scRNAseq analysis are the following:

Data Loading
QC
Normalization
Variance modeling
Dimensionality reduction
Clustering
Interpretation

Let’s dive in each step to understand what happens at each stage.

Data loading

Here, I am going to use an example of 10X Genomics data the counts matrix and associated metadata (cell barcodes, data path, etc.). This is based on Chapter 27 in OSCA book. The DropletTestFiles package contains files that are the raw output of pipelines like 10X Genomics’ CellRanger software suite, that can be then imported via the DropletUtils package’s read10xCounts() function.

library(DropletTestFiles)
raw.path <- getTestFile("tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz")
out.path <- file.path(tempdir(), "pbmc4k")
untar(raw.path, exdir=out.path)

library(DropletUtils)
fname <- file.path(out.path, "raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)

We need to make sure that we convert the easy interpretable gene symbol to a standard identifier that is guaranteed to be unique and valid (e.g., Ensembl).

library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
    rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)
library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, 
    column="SEQNAME", keytype="GENEID")

In addition to this, the scRNAseq package provides convenient access to several publicly available data sets in the form of SingleCellExperiment object.

library(scRNAseq)
out <- listDatasets()
dim(out)

## [1] 46  5

head(out)

## DataFrame with 6 rows and 5 columns
##                Reference  Taxonomy                 Part    Number
##              <character> <integer>          <character> <integer>
## 1 @aztekin2019identifi..      8355                 tail     13199
## 2 @bach2017differentia..     10090        mammary gland     25806
## 3   @baron2016singlecell      9606             pancreas      8569
## 4   @baron2016singlecell     10090             pancreas      1886
## 5 @buettner2015computa..     10090 embryonic stem cells       288
## 6 @campbell2017molecular     10090                brain     21086
##                     Call
##              <character>
## 1      AztekinTailData()
## 2      BachMammaryData()
## 3 BaronPancreasData('h..
## 4 BaronPancreasData('m..
## 5      BuettnerESCData()
## 6    CampbellBrainData()

QC

An unique aspect of droplet-based data is that we have no prior knowledge about whether a particular library (i.e., cell barcode) corresponds to cell-containing or empty droplets. We use the emptyDrops() function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool. emptyDrops() assumes that barcodes with low total UMI counts are empty droplets. emptyDrops() uses Monte Carlo simulations to compute p-values for the multinomial sampling transcripts from the ambient pool.

set.seed(100) #seed forreproducible results, emptyDrops performas a simulation.
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)] #we exclude dropplets that have low UMI counts

These are the unfiltered, non-empty drops

unfiltered <- sce.pbmc

For each cell, we calculate these QC metrics using the perCellQCMetrics() function from the scater package. The sum column contains the total count for each cell and the detected column contains the number of detected genes. Mitochondrial counts are informative, because small mitochondrial percentages, large spike-in percentages and small library sizes are likely to be stripped nuclei, i.e., they have been so extensively damaged that they have lost all cytoplasmic content, therefore are low quality cells. Here we use a relaxed QC strategy and only remove cells with large mitochondrial proportions, using it as a proxy for cell damage. This reduces the risk of removing cell types with low RNA content, especially in a heterogeneous PBMC population with many different cell types.

stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]

summary(high.mito)

##    Mode   FALSE    TRUE 
## logical    3985     315

colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- high.mito

gridExtra::grid.arrange(
    plotColData(unfiltered, y="sum", colour_by="discard") +
        scale_y_log10() + ggtitle("Total count"),
    plotColData(unfiltered, y="detected", colour_by="discard") +
        scale_y_log10() + ggtitle("Detected features"),
    plotColData(unfiltered, y="subsets_Mito_percent",
        colour_by="discard") + ggtitle("Mito percent"),
    ncol=2
)

Distribution of various QC metrics in the PBMC dataset after cell calling. Each point is a cell and is colored according to whether it was discarded by the mitochondrial filter.

Figure 1: Distribution of various QC metrics in the PBMC dataset after cell calling. Each point is a cell and is colored according to whether it was discarded by the mitochondrial filter.

plotColData(unfiltered, x="sum", y="subsets_Mito_percent",
    colour_by="discard") + scale_x_log10()

Figure 2: Proportion of mitochondrial reads in each cell of the PBMC dataset compared to its total count.

Normalization

There are a number of normalization methods. We use a pre-clustering step with quickCluster() where cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters. This avoids the assumption that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations. By default, quickCluster() will use an approximate algorithm for PCA based on methods from the irlba package. The approximation relies on stochastic initialization so we need to set the random seed (via set.seed()) for reproducibility. computeSumFactors performs a scaling normalization of single-cell RNA-seq data by deconvolving size factors from cell pools. Once we have computed the size factors, we use the logNormCounts() function from scater to compute normalized expression values for each cell. This is done by dividing the count for each gene/spike-in transcript with the appropriate size factor for that cell. The function also log-transforms the normalized values, creating a new assay called "logcounts". (Technically, these are “log-transformed normalized expression values”, but that’s too much of a mouthful to fit into the assay name.) These log-values will be the basis of our downstream analyses.

library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
sce.pbmc <- logNormCounts(sce.pbmc)

summary(sizeFactors(sce.pbmc))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00749  0.71207  0.87490  1.00000  1.09900 12.25412

We can plot The “library size factor” for each cell. As for bulk RNAseq, library size normalization is the simplest strategy for performing scaling normalization. However, library size normalization is usually sufficient in many applications where the aim is to identify clusters and the top markers that define each cluster.

plot(librarySizeFactors(sce.pbmc), sizeFactors(sce.pbmc), pch=16,
    xlab="Library size factors", ylab="Deconvolution factors", log="xy")

Figure 3: Relationship between the library size factors and the deconvolution size factors in the PBMC dataset.

Variance modelling/feature selection

Several methods are available to quantify the variation per gene and to select an appropriate set of highly variable genes (HVGs). UMI counts typically exhibit near-Poisson variation if we only consider technical noise from library preparation and sequencing. This can be used to construct a mean-variance trend in the log-counts with the modelGeneVarByPoisson function. We can then select the top 10% of genes with the highest biological components.

set.seed(1001)
dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)
top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)

plot(dec.pbmc$mean, dec.pbmc$total, pch=16, cex=0.5,
    xlab="Mean of log-expression", ylab="Variance of log-expression")
curfit <- metadata(dec.pbmc)
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)

Per-gene variance as a function of the mean for the log-expression values in the PBMC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to simulated Poisson counts.

Figure 4: Per-gene variance as a function of the mean for the log-expression values in the PBMC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to simulated Poisson counts.

Dimensionality reduction

Now we want to compare cells based on the values of gene expression. We can use denoisePCA function, that “de-noises” log-expression data by removing principal components corresponding to technical noise.

set.seed(10000)
sce.pbmc <- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc)
set.seed(100000)
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA") 
set.seed(1000000)
sce.pbmc <- runUMAP(sce.pbmc, dimred="PCA")

We verify how many PCs were retained.

ncol(reducedDim(sce.pbmc, "PCA"))

## [1] 9

Clustering

We can now create nearest-neighbor graphs with the buildSNNGraph function and cluster_walktrap function. cluster_walktrap will assign membership of densely connected subgraphs, also called “communities”.

g <- buildSNNGraph(sce.pbmc, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
colLabels(sce.pbmc) <- factor(clust)

table(colLabels(sce.pbmc))

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
## 205 508 541  56 374 125  46 432 302 867  47 155 166  61  84  16

plotTSNE(sce.pbmc, colour_by="label")

Figure 5: Obligatory \(t\)-SNE plot of the PBMC dataset, where each point represents a cell and is colored according to the assigned cluster.

Interpretation

To interpret the clustering results, we need to identify the genes that drive separation between clusters. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity. The same principle can be applied to discover more subtle differences between clusters (e.g., changes in activation or differentiation state) based on the behavior of genes in the affected pathways. We perform pairwise t-tests between clusters for each gene using the findMarkers function, which uses a Welch t-test to perform DE testing between clusters and returns a list of DataFrames containing ranked candidate markers for each cluster.

markers <- findMarkers(sce.pbmc, pval.type="some", direction="up")

Let’s say that we want to examine the markers for cluster 8 in more detail.

marker.set <- markers[["8"]]
as.data.frame(marker.set[1:30,1:3])

##                     p.value           FDR summary.logFC
## CSTA          7.170624e-222 2.015964e-217     2.4178954
## MNDA          1.196631e-221 2.015964e-217     2.6614935
## FCN1          2.375980e-213 2.668543e-209     2.6380934
## S100A12       4.393470e-212 3.700839e-208     3.0808902
## VCAN          1.711043e-199 1.153038e-195     2.2603760
## TYMP          1.173532e-154 6.590164e-151     2.0237930
## AIF1          3.673649e-149 1.768285e-145     2.4603604
## LGALS2        4.004740e-137 1.686696e-133     1.8927606
## MS4A6A        5.639909e-134 2.111457e-130     1.5457061
## FGL2          2.044513e-124 6.888781e-121     1.3859366
## RP11-1143G9.4 6.891551e-122 2.110945e-118     2.8042347
## AP1S2         1.786019e-112 5.014842e-109     1.7703547
## CD14          1.195352e-110 3.098169e-107     1.4259764
## CFD           6.870490e-109 1.653531e-105     1.3560255
## GPX1          9.048825e-107 2.032607e-103     2.4013937
## TNFSF13B       3.920319e-95  8.255701e-92     1.1151275
## KLF4           3.309726e-94  6.559876e-91     1.2049050
## GRN            4.801206e-91  8.987324e-88     1.3814668
## NAMPT          2.489624e-90  4.415020e-87     1.1438687
## CLEC7A         7.736088e-88  1.303299e-84     1.0616120
## S100A8         3.124930e-84  5.013875e-81     4.8051993
## SERPINA1       1.580359e-82  2.420392e-79     1.3842689
## CD36           8.018347e-79  1.174653e-75     1.0538169
## MPEG1          8.481588e-79  1.190744e-75     0.9778095
## CD68           5.118714e-78  6.898798e-75     0.9481203
## CYBB           1.200516e-77  1.555776e-74     1.0300245
## S100A11        1.174556e-72  1.465759e-69     1.8962486
## RBP7           2.467027e-71  2.968714e-68     0.9666127
## BLVRB          3.762610e-71  4.371634e-68     0.9701168
## CD302          9.859086e-71  1.107307e-67     0.8792077

The high expression of MNDA, CD14 and CD68 suggests that cluster 8 contains monocytes.

marker.set <- markers[["15"]]
as.data.frame(marker.set[1:30,1:3])

##                p.value          FDR summary.logFC
## AIF1      6.912466e-82 2.329086e-77      3.678553
## FCGR3A    3.606603e-64 6.076045e-60      3.041108
## SERPINA1  6.149257e-64 6.906436e-60      2.375549
## FTL       2.857395e-63 2.406927e-59      2.061453
## LST1      4.161324e-62 2.804233e-58      1.892121
## COTL1     1.582271e-61 8.885507e-58      1.305713
## CTSS      3.143428e-55 1.513067e-51      1.880227
## S100A11   1.719244e-54 7.241024e-51      2.030715
## TYMP      4.395645e-50 1.645632e-46      2.101758
## MS4A7     7.348647e-48 2.476053e-44      2.173820
## NAP1L1    2.806622e-47 8.596939e-44      1.142910
## FTH1      5.856763e-47 1.644481e-43      1.188781
## CD68      2.661218e-46 6.897469e-43      2.048530
## STXBP2    5.783395e-45 1.391898e-41      2.075910
## SAT1      9.269854e-43 2.082256e-39      1.541614
## TYROBP    3.459416e-42 7.285097e-39      1.152396
## PSAP      1.582943e-40 3.137392e-37      1.403571
## CSTB      3.871723e-40 7.247436e-37      1.129131
## CFD       4.529809e-39 8.033020e-36      1.999696
## FKBP1A    5.042155e-39 8.494519e-36      1.301575
## CEBPB     6.875529e-39 1.103162e-35      1.683662
## LINC01272 7.357858e-38 1.126889e-34      1.840199
## FCER1G    2.802003e-37 4.104813e-34      1.301967
## SPI1      9.236815e-36 1.296772e-32      1.890047
## S100A4    4.102320e-35 5.528943e-32      4.356299
## NEAT1     5.928869e-33 7.683358e-30      1.358303
## IFITM3    6.875461e-33 8.580066e-30      1.871086
## LYZ       8.944290e-32 1.076317e-28      1.804654
## LRRC25    1.072482e-31 1.246077e-28      1.242115
## RNASET2   7.560048e-31 8.490942e-28      1.090212

The high expression of FCGR3A suggests that cluster 15 contains macrophagues.

# Checking the cluster is what we wanted.
marker.set <- markers[["8"]]
topset <- rownames(marker.set)[1:30]
stopifnot(all(c("CD14", "CD68", "MNDA") %in% topset))

plotExpression(sce.pbmc, features=c("CD14", "CD68",
    "MNDA", "FCGR3A"), x="label", colour_by="label")

Figure 6: Distribution of expression values for monocyte and macrophage markers across clusters in the PBMC dataset.

bioconductor r scRNAseq

Maria Dermit

Postdoctoral Research Scientist

Postdoctoral researcher interested in translation control and data science for biomedical research.