1 Loading Libraries

suppressPackageStartupMessages(library(escape))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(dittoSeq))
suppressPackageStartupMessages(library(ggplot2))

1.1 Loading color palette

This is the color scheme used in the escape package, but for visualizations that come from other packages, we can use the new colorblind_vector() function to match aesthetics.

colorblind_vector <- colorRampPalette(rev(c("#0D0887FF", "#47039FFF", 
              "#7301A8FF", "#9C179EFF", "#BD3786FF", "#D8576BFF",
              "#ED7953FF","#FA9E3BFF", "#FDC926FF", "#F0F921FF")))

2 Loading Processed Single-Cell Data

For the purposes of this vignette a pre-processed Seurat object is available from the indicated URL. This data is a subset of a larger, published cutaneous T cell lymphoma project (CTCL) with 1,141 Nonmalignant or normal T cells (N) and 847 malignant T cells (T). We have already integrated and clustered the data into 8 distinct clusters. We will get more into the differences between the cell types as the vignette commences.

seurat_ex <- readRDS(url("https://ncborcherding.github.io/vignettes/escape_seurat_ex.rds"))
seurat_ex <- UpdateSeuratObject(seurat_ex)
DimPlot(seurat_ex, label = T) + NoLegend()

DimPlot(seurat_ex, group.by = "Type") + scale_color_manual(values = colorblind_vector(2))

3 Getting Gene Sets

3.1 Option 1: Molecular Signture Database

The first step in the process of performing gene set enrichment analysis is identifying the gene sets we would like to use. The function getGeneSets() allows users to isolate a whole or multiple libraries from a list of GSEABase GeneSetCollection objects. We can do this for gene set collections from the built-in Molecular Signature Database by setting the parameter library equal to library/libraries of interest. For multiple libraries, just set library = c(“Library1”, “Library2”, etc).

In Addition:
- Individual pathways/gene sets can be isolated from the libraries selected, by setting gene.sets = the name(s) of the gene sets of interest.
- Subcategories of the invidual libaries can be selected using the subcategory parameter.

If the sequencing of the single-cell data is performed on a species other than “Homo sapiens”, make sure to use the species parameter in getGeneSets() in order to get the correct gene nomenclature.

gene.sets <- getGeneSets(library = "H")

3.2 Option 2: Built-In gene sets

data("escape.gene.sets", package="escape")
gene.sets <- escape.gene.sets

3.3 Option 3: Define personal gene sets

gene.sets <- list(Tcell_signature = c("CD2","CD3E","CD3D"),
            Myeloid_signature = c("SPI1","FCER1G","CSF1R"))

4 Enrichment

The next step is performing the enrichment on the RNA count data. The function enrichIt() can handle either a matrix of raw count data or will pull that data directly from a SingleCellExperiment or Seurat object. The gene.sets parameter in the function is the GeneSetCollection, either generated from getGeneSets() or from the user. The enrichment scores will be calculated across all individual cells and groups is the n size to break the enrichment by while the cores is the number of cores to perform in parallel during the enrichment calculation too.

enrichIt() can utilize two distinct methods for quantification using the method parameter - either the “ssGSEA” method described by Barbie et al 2009 or “UCell” described by Andreatta and Carmona 2021.

To prevent issues with enrichment calculations for gene sets with a low number of genes represented in the count data of a single-cell object - min.size was instituted to remove gene sets with less than the indicated genes. For the purposes of the demo, however, I am setting the value to NULL.

Important Note: This is computationally intensive and is highly dependent on the number of cells and the number of gene sets included.

ES <- enrichIt(obj = seurat_ex, 
               gene.sets = gene.sets, 
               groups = 1000, cores = 4, 
               min.size = NULL)

We can then easily add these results back to our Seurat object using the AddMetaData() function in the Seurat package.

seurat_ex <- AddMetaData(seurat_ex, ES)

5 Visualizations

The easiest way to generate almost any visualization for single cell data is via r Biocpkg(“dittoSeq”), which is an extremely flexible visualization package for both bulk and single-cell RNA-seq data that works very well for both expression data and metadata.

dittoSeq uses the meta data for grouping parameters, to include our labeled cluster IDs, we will just make a new meta data variable active.ident

seurat_ex@meta.data$active.idents <- seurat_ex@active.ident

5.1 1. The Heatmap

A simple way to approach visualizations for enrichment results is the heatmap, especially if you are using a number of gene sets or libraries.

dittoHeatmap(seurat_ex, genes = NULL, metas = names(ES),
             heatmap.colors = rev(colorblind_vector(50)),
             annot.by = c("active.idents", "Type"),
             cluster_cols = TRUE,
             fontsize = 7)

A user can also produce a heatmap with select gene sets by providing specific names to the metas parameter. For example, we can isolated gene sets involved in apoptosis and DNA damage.

dittoHeatmap(seurat_ex, genes = NULL, 
             metas = c("HALLMARK_APOPTOSIS", "HALLMARK_DNA_REPAIR", "HALLMARK_P53_PATHWAY"), 
              heatmap.colors = rev(colorblind_vector(50)),
             annot.by = c("active.idents", "Type"),
             cluster_cols = TRUE,
             fontsize = 7)

5.2 2. The Violin Plot

Another way to visualize a subset of gene set enrichment would be to graph the distribution of enrichment using violin, jitter, boxplot, or ridgeplots. We can also compare between categorical variables using the group.by parameter.

dittoPlot(seurat_ex, "HALLMARK_DNA_REPAIR", group.by = "Type") + 
    scale_fill_manual(values = colorblind_vector(2))

5.3 3. Hex Density Enrichment Plots

We can also compare the distribution of enrichment scores of 2 distinct gene sets across all single cells using the dittoScatterHex() function. Here, we use our SingleCellExperiment object with results of enrichIt() and specify gene sets to the x.var and y.var parameters to produce a density plot. We can also add contours to the plot, by passing do.contour = TRUE. I added the addition of geom_hline and geom_vline to create an origin, this is particularly handy when starting to facet the dittoScatterHex() plots.

dittoScatterHex(seurat_ex, 
    x.var = "HALLMARK_DNA_REPAIR", y.var = "HALLMARK_MTORC1_SIGNALING", do.contour = TRUE) +             
    theme_classic() +
    scale_fill_gradientn(colors = rev(colorblind_vector(11))) +
    geom_vline(xintercept = 0, lty=2) + 
    geom_hline(yintercept = 0, lty=2)  

We can also separate the graph using the split.by parameter, allowing for the direct comparison of categorical variables. For example, we can look at the difference in the distribution of the selected gene.sets by cell type.

dittoScatterHex(seurat_ex,
    x.var = "HALLMARK_DNA_REPAIR", y.var = "HALLMARK_MTORC1_SIGNALING", do.contour = TRUE, split.by = "Type") + 
    theme_classic() + 
    scale_fill_gradientn(colors = rev(colorblind_vector(11))) +
    geom_vline(xintercept = 0, lty=2) + 
    geom_hline(yintercept = 0, lty=2) 

##4. Enrichment along a Ridge Plot

Another distribution visualization is using the Ridge Plot from the ggridges R package. This allows the user to incorporate categorical variables to separate the enrichment scores along the y-axis in addition to the faceting by categorical variables.

Like above, we can explore the distribution of the “HALLMARK_DNA_REPAIR” between Types by calling the ridgeEnrichment() with the ES2 object. We specify the groups = Type, which will separate tumor and normal cells on the y-axis. We can also add a rug parameter (add.rug), to look at the discrete sample placement along the enrichment ridge plot.

ES2 <- data.frame(seurat_ex[[]], Idents(seurat_ex))
colnames(ES2)[ncol(ES2)] <- "cluster"
ridgeEnrichment(ES2, gene.set = "HALLMARK_DNA_REPAIR", group = "Type", add.rug = TRUE)

In addition to the separation of Type, we can also use the ridgeEnrichment() for better granularity of multiple variables. For example, instead of looking at the difference just between “Type”, we can set the group = “cluster” and then facet by “Type”. This gives a visualization of the enrichment of DNA Repair by cluster and Type.

ridgeEnrichment(ES2, gene.set = "HALLMARK_DNA_REPAIR", group = "cluster", facet = "Type", add.rug = TRUE)

5.4 5. The Split Violin Plot

Another distribution visualization is a violin plot, which we separate and directly compare using a binary classification. Like RidgeEnrichment(), this allows for greater use of categorical variables. For SplitEnrichment(), the output will be two halves of a violin plot based on the split parameter with a central boxplot with the relative distribution across all samples.

Like above, we can explore the distribution of the “HALLMARK_DNA_REPAIR” between Types by calling the SplitEnrichment() with the ES2 object. We specify the split = “Type”“, which will separate tumor and normal cells in the violin plot itself. Like above, we can also explore the cluster distribution by assigning the x-axis =”cluster".

splitEnrichment(ES2, split = "Type", gene.set = "HALLMARK_DNA_REPAIR")

splitEnrichment(ES2, x.axis = "cluster", split = "Type", gene.set = "HALLMARK_DNA_REPAIR")

5.5 6. Enrichment Plots

New to the dev version of the escape is enrichmentPlot() that takes the single-cell expression object (pbmc_small) and calculates mean rank order for a gene set across groups. The function requires the name of the specific gene set (gene.set) and the library of gene sets (gene.sets) to extract the rank positions from the count data stored in the Seurat or single-cell expression object.

Note: As of right now, the rank order is based on the ssGSEA calculation for the visualization.

enrichmentPlot(seurat_ex, 
               gene.set = "HALLMARK_DNA_REPAIR",
               gene.sets = gene.sets,
               group = "Type")  + 
  scale_color_manual(values = colorblind_vector(5)[c(1,4)])


6 Expanded Analysis

One frustration of Gene Set Enrichment is trying to make sense of the values. In order to move away from just selecting pathways that may be of interest, escape offers the ability to performPCA() on the enrichment scores. Like the other functions, we will need the output of enrichIt() and the groups to include for later graphing.

PCA <- performPCA(enriched = ES2, gene.sets = names(gene.sets), groups = c("Type", "cluster"))
pcaEnrichment(PCA, PCx = "PC1", PCy = "PC2", contours = TRUE)

We can also look at the pcaEnrichment() output separated by categorical factors using the facet parameter, for example using the cluster assignment.

pcaEnrichment(PCA, PCx = "PC1", PCy = "PC2", contours = FALSE, facet = "cluster") 

We can more closely examine the construction of the PCA by looking at the contribution of each gene set to the respective principal component using masterPCAPlot() with the same input as above with the pcaEnrichment(). We can also control the gene sets graphed by selecting more or less than 10 for top.contribution.

masterPCAPlot(ES2, gene.sets = names(gene.sets), PCx = "PC1", PCy = "PC2", top.contribution = 10)


7 Signficance Testing

We can also look for significant differences between groups of variables using getSignificance(). For this, we need to assign a group parameter and the type of fit including: T.test, linear regression (LR), Wilcoxon Rank Sum Test (Wilcoxon), ANOVA). getSignificance() will pull any numeric values, to ensure only gene sets, use the gene.sets parameter, similar to the PCAplot functions above.

Returned is a test statistic, raw p value, FDR value using Bonferroni correction, and the median values for each group. In addition, ANOVA will automatically return the raw p-values for each comparison in the group.

output <- getSignificance(ES2, group = "Type", fit = "T.test") 

8 Conclusions

Hopefully this is helpful for analysis of single-cell datasets. Although not extensive, in terms of a fully-functional toolkit, escape functions focus on issues that PIs frequently request, making sense of data. If you have any questions, comments or suggestions, feel free to visit the github repository or email me.

##Session Info

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dittoSeq_1.4.1     ggplot2_3.3.5      SeuratObject_4.0.4 Seurat_4.1.0      
## [5] escape_1.4.1       BiocStyle_2.20.2  
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2                  reticulate_1.24            
##   [3] tidyselect_1.1.2            RSQLite_2.2.10             
##   [5] AnnotationDbi_1.54.1        htmlwidgets_1.5.4          
##   [7] grid_4.1.0                  BiocParallel_1.26.1        
##   [9] Rtsne_0.15                  munsell_0.5.0              
##  [11] ScaledMatrix_1.0.0          codetools_0.2-18           
##  [13] ica_1.0-2                   future_1.24.0              
##  [15] miniUI_0.1.1.1              withr_2.4.3                
##  [17] spatstat.random_2.1-0       colorspace_2.0-3           
##  [19] Biobase_2.52.0              highr_0.9                  
##  [21] knitr_1.37                  rstudioapi_0.13            
##  [23] stats4_4.1.0                SingleCellExperiment_1.14.1
##  [25] ROCR_1.0-11                 tensor_1.5                 
##  [27] listenv_0.8.0               MatrixGenerics_1.4.2       
##  [29] labeling_0.4.2              GenomeInfoDbData_1.2.6     
##  [31] polyclip_1.10-0             bit64_4.0.5                
##  [33] farver_2.1.0                pheatmap_1.0.12            
##  [35] rhdf5_2.36.0                parallelly_1.30.0          
##  [37] vctrs_0.3.8                 generics_0.1.2             
##  [39] xfun_0.29                   R6_2.5.1                   
##  [41] GenomeInfoDb_1.28.1         rsvd_1.0.5                 
##  [43] isoband_0.2.5               msigdbr_7.4.1              
##  [45] bitops_1.0-7                rhdf5filters_1.4.0         
##  [47] spatstat.utils_2.3-0        cachem_1.0.6               
##  [49] DelayedArray_0.18.0         assertthat_0.2.1           
##  [51] promises_1.2.0.1            scales_1.1.1               
##  [53] gtable_0.3.0                beachmat_2.8.0             
##  [55] globals_0.14.0              goftest_1.2-3              
##  [57] rlang_1.0.1                 splines_4.1.0              
##  [59] lazyeval_0.2.2              hexbin_1.28.2              
##  [61] spatstat.geom_2.3-2         broom_0.7.12               
##  [63] BiocManager_1.30.16         yaml_2.3.5                 
##  [65] reshape2_1.4.4              abind_1.4-5                
##  [67] backports_1.4.1             httpuv_1.6.5               
##  [69] tools_4.1.0                 bookdown_0.24              
##  [71] ellipsis_0.3.2              spatstat.core_2.4-0        
##  [73] jquerylib_0.1.4             RColorBrewer_1.1-2         
##  [75] BiocGenerics_0.38.0         ggridges_0.5.3             
##  [77] Rcpp_1.0.8                  plyr_1.8.6                 
##  [79] sparseMatrixStats_1.4.2     zlibbioc_1.38.0            
##  [81] purrr_0.3.4                 RCurl_1.98-1.6             
##  [83] rpart_4.1.16                deldir_1.0-6               
##  [85] pbapply_1.5-0               cowplot_1.1.1              
##  [87] S4Vectors_0.30.0            zoo_1.8-9                  
##  [89] SummarizedExperiment_1.22.0 ggrepel_0.9.1              
##  [91] cluster_2.1.2               magrittr_2.0.2             
##  [93] data.table_1.14.2           magick_2.7.3               
##  [95] scattermore_0.8             lmtest_0.9-39              
##  [97] RANN_2.6.1                  fitdistrplus_1.1-6         
##  [99] matrixStats_0.61.0          patchwork_1.1.1            
## [101] mime_0.12                   evaluate_0.15              
## [103] GSVA_1.40.1                 xtable_1.8-4               
## [105] XML_3.99-0.9                IRanges_2.26.0             
## [107] gridExtra_2.3               compiler_4.1.0             
## [109] UCell_1.1.0                 tibble_3.1.6               
## [111] KernSmooth_2.23-20          crayon_1.5.0               
## [113] htmltools_0.5.2             mgcv_1.8-39                
## [115] later_1.3.0                 snow_0.4-4                 
## [117] tidyr_1.2.0                 DBI_1.1.2                  
## [119] MASS_7.3-55                 babelgene_21.4             
## [121] Matrix_1.4-0                cli_3.2.0                  
## [123] parallel_4.1.0              igraph_1.2.11              
## [125] GenomicRanges_1.44.0        pkgconfig_2.0.3            
## [127] plotly_4.10.0               spatstat.sparse_2.1-0      
## [129] annotate_1.70.0             bslib_0.3.1                
## [131] XVector_0.32.0              stringr_1.4.0              
## [133] digest_0.6.29               sctransform_0.3.3          
## [135] RcppAnnoy_0.0.19            graph_1.70.0               
## [137] spatstat.data_2.1-2         Biostrings_2.60.2          
## [139] rmarkdown_2.12              leiden_0.3.9               
## [141] uwot_0.1.11                 DelayedMatrixStats_1.14.2  
## [143] GSEABase_1.54.0             shiny_1.7.1                
## [145] lifecycle_1.0.1             nlme_3.1-155               
## [147] jsonlite_1.8.0              Rhdf5lib_1.14.2            
## [149] viridisLite_0.4.0           fansi_1.0.2                
## [151] pillar_1.7.0                lattice_0.20-45            
## [153] KEGGREST_1.32.0             fastmap_1.1.0              
## [155] httr_1.4.2                  survival_3.2-13            
## [157] glue_1.6.2                  png_0.1-7                  
## [159] bit_4.0.4                   stringi_1.7.6              
## [161] sass_0.4.0                  HDF5Array_1.20.0           
## [163] blob_1.2.2                  BiocSingular_1.8.1         
## [165] memoise_2.0.1               dplyr_1.0.8                
## [167] irlba_2.3.5                 future.apply_1.8.1