In this vignette, we aim to describe default process of FlexiDeconv to convert scRNAseq data to reference. The process uses three input datasets: 1. Spatial data with dimension pixel x gene 2. scRNAseq data with dimension cell x gene 3. Cell type annotation for every cell within scRNAseq, as a vector
For this vignette, we use the spatial data obtained from [@Moffitt2018DryadHypothalamus] and scRNAseq data from [@NeMO_dat]. Note that the process described here is slightly different to the actual imperfect reference section in the paper, for the following purposes: 1. No redundant gene filtering was done in the actual paper, because the geneset size is already quite limited, there is no point to filter genes for speed purposes. 2. No variable gene selection was done in the actual paper, same reason as above. 3. The scRNAseq dataset only contains a small subset because the original dataset was too large.
Load in the libraries and data
library(FlexiDeconv)
library(STdeconvolve)
library(ggplot2)
library(ggdist)
data(scRNAseq_to_reference)
Spatial data sample (5 pixels x 5 genes) Note that the rownames are simulated pixel barcodes.
scRNAseq_to_reference$spatial[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## Ace2 Adora2a Aldh1l1 Amigo2 Ano3
## 1 0 1 4 0 0
## 2 0 0 2 1 0
## 3 0 0 1 2 0
## 4 0 . 3 2 0
## 5 0 0 1 0 0
scRNAseq data sample (5 cells x 5 genes)
scRNAseq_to_reference$scRNAseqCount[1:5,1:5]
## Xkr4 Gm1992 Gm19938 Gm37381 Rp1
## GACCGTGAGTAGTCAA-L8TX_201120_01_G09-1178468020 0 0 0 0 0
## AACGGGAGTGGCTGCT-L8TX_200924_01_G05-1177903030 0 0 0 0 0
## TCATTTGAGCAGGGAG-L8TX_201015_01_H02-1177903092 25 0 5 0 0
## GGCTTTCGTACGCTTA-L8TX_220331_01_D05-1178166166 37 1 1 0 0
## GAGGCCTAGATGTTGA-L8TX_201029_01_G10-1177903514 10 0 4 0 0
scRNAseq cell type data (5 cells)
scRNAseq_to_reference$scRNAseqCellType[1:5]
## GACCGTGAGTAGTCAA-L8TX_201120_01_G09-1178468020 AACGGGAGTGGCTGCT-L8TX_200924_01_G05-1177903030 TCATTTGAGCAGGGAG-L8TX_201015_01_H02-1177903092 GGCTTTCGTACGCTTA-L8TX_220331_01_D05-1178166166
## "OB-IMN GABA" "OB-IMN GABA" "CTX-MGE GABA" "HY Glut"
## GAGGCCTAGATGTTGA-L8TX_201029_01_G10-1177903514
## "OPC-Oligo"
Typically, for performing deconvolution on Spatial Transcriptomics data obtained from human tissues, we would filter out non-human protein coding genes. However, since the current spatial data is from mouse, we do not perform such procedure.
For dataset size issues, we decide to leave this part blank and invite users to filter for human protein coding genes themselves.
Here we find interesection gene set between spatial data and scRNAseq data.
common_geneset = commonGeneFilter(scRNAseq_to_reference$spatial,
scRNAseq_to_reference$scRNAseqCount)
Here we remove redundant genes, defined as genes that have 0 counts across all pixels or across all cells.
remove_redundant = redundantGeneFilter(common_geneset$spatial, common_geneset$scRNAseq)
Here we select variable genes using STdeconvolve’s default gene selection method. The purpose of step 4 is to reduce geneset size. Note that we only use spatial data for variable gene selection here.
Up to this point, spatial data is finalized.
variableGenes = t(STdeconvolve::restrictCorpus(t(as.matrix(remove_redundant$spatial)),
removeAbove = 1.0,
removeBelow = 0.01,
alpha = 0.05,
plot = F,
verbose = T))
## Removing 0 genes present in 100% or more of pixels...
## 128 genes remaining...
## Removing 1 genes present in 1% or less of pixels...
## 127 genes remaining...
## Restricting to overdispersed genes with alpha = 0.05...
## Converting to sparse matrix ...
## Calculating variance fit ...
## Using gam with k=5...
## 36 overdispersed genes ...
## Using top 1000 overdispersed genes.
## number of top overdispersed genes available: 36
variableGenesData = list(spatial = variableGenes,
scRNAseq = remove_redundant$scRNAseq[,colnames(variableGenes)])
Finally, we remove outlier cells in scRNAseq data, with outlier cells defined as follows: 1. Cells with total gene count 1.5*IQR above Q3 of all cells within current cell type. 2. Gene ranked lowerThresholdRank (default 2) having less than lowerThresholdCount counts (default 3). For example, under the current default setting, if the second highest gene count within a cell is below 3, then we remove this cell. It is motivated by keep cells having sufficient count in the marker genes.
We also plot (rain cloud plot) the distribution of total read depth (total gene count) for every cell type.
removeOutlierCells = outlierCellRemoval(variableGenesData$scRNAseq,
scRNAseq_to_reference$scRNAseqCellType,
lowerThresholdRank = 2,
lowerThresholdCount = 3)
Finally, we make reference by: 1. For every cell type, find all cells belonging to current cell type 2. Sum the raw gene counts across all cells 3. Normalize the overall gene count for this cell type such that the sum is 1.
reference = formReference(removeOutlierCells$scRNAseq, removeOutlierCells$celltypes)
visualizeReference(reference)
We also allow user to input their own cell type x gene reference, with some caveats addressed here.
data(reference_ready)
reference_ready[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## Ano3 Arhgap36 Bmp7 Calcr Cbln1
## CNU-HYa GABA 0.059945504 0.0602482592 0.0003027551 0.0006055101 0.0036330609
## CNU-HYa Glut 0.014664266 0.0334448161 0.0007718034 0.0095189092 0.0537689735
## CNU-LGE GABA 0.281306463 0.0033356498 0.0002779708 0.0033356498 0.0002779708
## CNU-MGE GABA 0.005159959 0.0008845644 . . 0.0005897096
## CTX-CGE GABA 0.002653438 0.0004614675 0.0003461006 0.0001153669 .
The example reference used here is exactly the same as above. There are zeros in the reference, which is not allowed by FlexiDeconv. The default workaround is to replace zero by 0.1 x smallest non-zero value in the reference, then we rescale the reference to make row sums be 1.
ref_replace_zero <- replaceZero(reference_ready, zeroFillFactor = 0.1)
ref_replace_zero[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## Ano3 Arhgap36 Bmp7 Calcr Cbln1
## CNU-HYa GABA 0.059944497 0.0602472469 3.027500e-04 6.055000e-04 3.633000e-03
## CNU-HYa Glut 0.014663650 0.0334434112 7.717710e-04 9.518509e-03 5.376671e-02
## CNU-LGE GABA 0.281289921 0.0033354536 2.779545e-04 3.335454e-03 2.779545e-04
## CNU-MGE GABA 0.005159699 0.0008845198 8.400820e-06 8.400820e-06 5.896798e-04
## CTX-CGE GABA 0.002653282 0.0004614403 3.460802e-04 1.153601e-04 8.400749e-06
rowSums(as.matrix(ref_replace_zero))
## CNU-HYa GABA CNU-HYa Glut CNU-LGE GABA CNU-MGE GABA CTX-CGE GABA CTX-MGE GABA DG-IMN Glut HY GABA HY Glut HY MM Glut Immune IT-ET Glut MB Dopa MB GABA MB Glut
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## MB-HB Sero MY Glut NP-CT-L6b Glut OB-CR Glut OB-IMN GABA OPC-Oligo P GABA P Glut Vascular
## 1 1 1 1 1 1 1 1 1