Consensus Taxonomy and Marker Genes

Cell Type Identification in single cell RNASeq data

The SpaceTx consortium chose the mouse visual cortex as the substrate for spatial transcriptomics comparisons. This was Consensus clustering

Computational strategies:

  • (Eeshit DV) SIMLR (Single-cell Interpretation via Multi-kernel LeaRning) learns an appropriate distance metric for clustering and accounts for drop-outs. Can also be run as a consensus clustering strategy. (https://github.com/BatzoglouLabSU/SIMLR).

  • (Peter K) PAGODA (Pathway And Gene set Over-Dispersion Analysis) resolves multiple aspects of gene expression heterogeneity by testing gene sets for coordinated variability among measured cells (http://hms-dbmi.github.io/scde/pagoda.html).

  • (Zizhen Y) scrattch.hicat (Hierarchical, Iterative Clustering for Analysis of Transcriptomics) iteratively repeats a multi-step approach on each cluster using different variable genes until no new clusters can be defined, then bootstraps the iterative process (https://github.com/AllenInstitute/scrattch.hicat).

  • (Trygve B) - Precursor pipeline used for human MTG

  • (Kenneth H) proMMT (Probabilistic Mixture Modeling for Transcriptomics) uses an EM algorithm to iteratively define clusters using the Bayesian information criterion and identify a sparse set of genes varying across clusters (Harris et al 2018; “Classes and continua…”; PLoS Biol.).

Most methods were run multiple times with different parameter settings.

Cluster consensus:

  • (Trygve B) - Cluster comparisons using Adjusted Rand, VI, etc. and showing which clusters subclasses /types are split more by various methods

  • (Brian A) - Consensus mouse and human clusters were defined using clusterExperiment R library, which clusters the co-occurrence matrix (https://github.com/epurdom/clusterExperiment).

Mouse clusters were defined on VISp + ALM data set and 68 ALM-specific clusters were removed from the consensus clustering, for 191 total clusters

Marker Gene Selection

Four different computational strategies:

  • (Jeremy M) - Greedy algorithm generally identifies combinatorial genes. This method was used for completing smaller panels and validating all panels.

  • (Kenneth H) Probability model aimed to maximize the average number of levels of a clustering tree correctly classified

  • (Brian A) - NS-Forest: Random forest strategy aimed at identifying the most informative markers of specific cell types

  • (Eeshit DV) Random forest strategy which ranks genes based on the overall importance towards classification

(Eeshit DV) These methods were merged together into a ranked list and some gene panels were selected from this list.