scRNAseq to Reference

In this vignette, we aim to describe default process of FlexiDeconv to convert scRNAseq data to reference. The process uses three input datasets: 1. Spatial data with dimension pixel x gene 2. scRNAseq data with dimension cell x gene 3. Cell type annotation for every cell within scRNAseq, as a vector

For this vignette, we use the spatial data obtained from [@Moffitt2018DryadHypothalamus] and scRNAseq data from [@NeMO_dat]. Note that the process described here is slightly different to the actual imperfect reference section in the paper, for the following purposes: 1. No redundant gene filtering was done in the actual paper, because the geneset size is already quite limited, there is no point to filter genes for speed purposes. 2. No variable gene selection was done in the actual paper, same reason as above. 3. The scRNAseq dataset only contains a small subset because the original dataset was too large.

Load in the libraries and data

library(FlexiDeconv)
library(STdeconvolve)
library(ggplot2)
library(ggdist)
data(scRNAseq_to_reference)

Spatial data sample (5 pixels x 5 genes) Note that the rownames are simulated pixel barcodes.

scRNAseq_to_reference$spatial[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##   Ace2 Adora2a Aldh1l1 Amigo2 Ano3
## 1    0       1       4      0    0
## 2    0       0       2      1    0
## 3    0       0       1      2    0
## 4    0       .       3      2    0
## 5    0       0       1      0    0

scRNAseq data sample (5 cells x 5 genes)

scRNAseq_to_reference$scRNAseqCount[1:5,1:5]
##                                                Xkr4 Gm1992 Gm19938 Gm37381 Rp1
## GACCGTGAGTAGTCAA-L8TX_201120_01_G09-1178468020    0      0       0       0   0
## AACGGGAGTGGCTGCT-L8TX_200924_01_G05-1177903030    0      0       0       0   0
## TCATTTGAGCAGGGAG-L8TX_201015_01_H02-1177903092   25      0       5       0   0
## GGCTTTCGTACGCTTA-L8TX_220331_01_D05-1178166166   37      1       1       0   0
## GAGGCCTAGATGTTGA-L8TX_201029_01_G10-1177903514   10      0       4       0   0

scRNAseq cell type data (5 cells)

scRNAseq_to_reference$scRNAseqCellType[1:5]
## GACCGTGAGTAGTCAA-L8TX_201120_01_G09-1178468020 AACGGGAGTGGCTGCT-L8TX_200924_01_G05-1177903030 TCATTTGAGCAGGGAG-L8TX_201015_01_H02-1177903092 GGCTTTCGTACGCTTA-L8TX_220331_01_D05-1178166166 
##                                  "OB-IMN GABA"                                  "OB-IMN GABA"                                 "CTX-MGE GABA"                                      "HY Glut" 
## GAGGCCTAGATGTTGA-L8TX_201029_01_G10-1177903514 
##                                    "OPC-Oligo"

Step 1: human protein coding genes filter

Typically, for performing deconvolution on Spatial Transcriptomics data obtained from human tissues, we would filter out non-human protein coding genes. However, since the current spatial data is from mouse, we do not perform such procedure.

For dataset size issues, we decide to leave this part blank and invite users to filter for human protein coding genes themselves.

Step 2: find intersection gene set

Here we find interesection gene set between spatial data and scRNAseq data.

common_geneset = commonGeneFilter(scRNAseq_to_reference$spatial,
                                  scRNAseq_to_reference$scRNAseqCount)

Step 3: remove redundant genes

Here we remove redundant genes, defined as genes that have 0 counts across all pixels or across all cells.

remove_redundant = redundantGeneFilter(common_geneset$spatial, common_geneset$scRNAseq)

Step 4: retain variable genes

Here we select variable genes using STdeconvolve’s default gene selection method. The purpose of step 4 is to reduce geneset size. Note that we only use spatial data for variable gene selection here.

Up to this point, spatial data is finalized.

variableGenes = t(STdeconvolve::restrictCorpus(t(as.matrix(remove_redundant$spatial)), 
                                        removeAbove = 1.0,
                                        removeBelow = 0.01,
                                        alpha = 0.05,
                                        plot = F,
                                        verbose = T))
## Removing 0 genes present in 100% or more of pixels...
## 128 genes remaining...
## Removing 1 genes present in 1% or less of pixels...
## 127 genes remaining...
## Restricting to overdispersed genes with alpha = 0.05...
## Converting to sparse matrix ...
## Calculating variance fit ...
## Using gam with k=5...
## 36 overdispersed genes ...
##  Using top 1000 overdispersed genes.
##  number of top overdispersed genes available: 36
variableGenesData = list(spatial = variableGenes,
                         scRNAseq = remove_redundant$scRNAseq[,colnames(variableGenes)])

Step 5: remove outlier cells

Finally, we remove outlier cells in scRNAseq data, with outlier cells defined as follows: 1. Cells with total gene count 1.5*IQR above Q3 of all cells within current cell type. 2. Gene ranked lowerThresholdRank (default 2) having less than lowerThresholdCount counts (default 3). For example, under the current default setting, if the second highest gene count within a cell is below 3, then we remove this cell. It is motivated by keep cells having sufficient count in the marker genes.

We also plot (rain cloud plot) the distribution of total read depth (total gene count) for every cell type.

removeOutlierCells = outlierCellRemoval(variableGenesData$scRNAseq, 
                                        scRNAseq_to_reference$scRNAseqCellType,
                                        lowerThresholdRank = 2,
                                        lowerThresholdCount = 3)

Step 6: form reference

Finally, we make reference by: 1. For every cell type, find all cells belonging to current cell type 2. Sum the raw gene counts across all cells 3. Normalize the overall gene count for this cell type such that the sum is 1.

reference = formReference(removeOutlierCells$scRNAseq, removeOutlierCells$celltypes)
visualizeReference(reference)

User-input reference

We also allow user to input their own cell type x gene reference, with some caveats addressed here.

data(reference_ready)
reference_ready[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                     Ano3     Arhgap36         Bmp7        Calcr        Cbln1
## CNU-HYa GABA 0.059945504 0.0602482592 0.0003027551 0.0006055101 0.0036330609
## CNU-HYa Glut 0.014664266 0.0334448161 0.0007718034 0.0095189092 0.0537689735
## CNU-LGE GABA 0.281306463 0.0033356498 0.0002779708 0.0033356498 0.0002779708
## CNU-MGE GABA 0.005159959 0.0008845644 .            .            0.0005897096
## CTX-CGE GABA 0.002653438 0.0004614675 0.0003461006 0.0001153669 .

The example reference used here is exactly the same as above. There are zeros in the reference, which is not allowed by FlexiDeconv. The default workaround is to replace zero by 0.1 x smallest non-zero value in the reference, then we rescale the reference to make row sums be 1.

ref_replace_zero <- replaceZero(reference_ready, zeroFillFactor = 0.1)
ref_replace_zero[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                     Ano3     Arhgap36         Bmp7        Calcr        Cbln1
## CNU-HYa GABA 0.059944497 0.0602472469 3.027500e-04 6.055000e-04 3.633000e-03
## CNU-HYa Glut 0.014663650 0.0334434112 7.717710e-04 9.518509e-03 5.376671e-02
## CNU-LGE GABA 0.281289921 0.0033354536 2.779545e-04 3.335454e-03 2.779545e-04
## CNU-MGE GABA 0.005159699 0.0008845198 8.400820e-06 8.400820e-06 5.896798e-04
## CTX-CGE GABA 0.002653282 0.0004614403 3.460802e-04 1.153601e-04 8.400749e-06
rowSums(as.matrix(ref_replace_zero))
##   CNU-HYa GABA   CNU-HYa Glut   CNU-LGE GABA   CNU-MGE GABA   CTX-CGE GABA   CTX-MGE GABA    DG-IMN Glut        HY GABA        HY Glut     HY MM Glut         Immune     IT-ET Glut        MB Dopa        MB GABA        MB Glut 
##              1              1              1              1              1              1              1              1              1              1              1              1              1              1              1 
##     MB-HB Sero        MY Glut NP-CT-L6b Glut     OB-CR Glut    OB-IMN GABA      OPC-Oligo         P GABA         P Glut       Vascular 
##              1              1              1              1              1              1              1              1              1