Single-cell transcriptome and chromatin accessibility dataintegration reveals cell specific signatures
1Division of Neuroblastoma Genomics, German Cancer Research Center, Heidelberg, Germany.2Health Data Science Unit, Medical Faculty Heidelberg and BioQuant.*Correspondence: [email protected]
Andrés Quintero1,2,°, Anne-Claire Kröger2 and Carl Herrmann2,*
recover cell specific signatures between different species1
2
The ability to integrate multiplelayers of omics data plays anessential role in understanding thecomplex interplay of differentmolecular mechanisms that giverise to cellular diversity.
To address this challenge weimplemented Integrative IterativeNon-negative Matrix Factorization(i2NMF), a computational method todissect genomic signatures frommulti-omics data sets.
i2NMF was implemented as anextension of the R package Brat-wurst available in Github.
We applied i2NMF to :
https://github.com/wurst- theke/bratwurst
identify rare cell populations
For ev
ery initial matrix Xn
Variable number of signaturesfor every data set
Signatures are sharedacross data sets
Stage 1 Stage 2i2NMF workflow:
The shared effect is recovered in the Hsmatrix, and the exposure of the featuresexplaining this effect are contained in theWs matrices (Yang and Michailidis, 2016).
The explained variance of thedecomposed model can beestimated for both stages by:
This is useful to compare theperformance between stages and the
overall decomposition
Common featuresshould be sharedacross columns,e.g. gene orsample IDs.
i2NMF advantages
● The feature exposure matricesWs and Wr are different,
recovering unique signaturesbetween stage 1 and 2
● The number of inferredsignatures in stage 2 can varyacross matrices, allowing a betterresolution of specific effects.
● All solvers were implementedon TensorFlow, allowing
scalability between platforms.
Hrn
Hr1
Hr2
+×≈
= ×
X₁
X₂Xn
Xrn
Xrn
−Xn HsWsn
Hs
Ws1
Ws2
Wsn
1. Starting from two or more non-negative matrices, i2NMF initially
decomposes the shared effect across them,using integrative NMF (iNMF).Solving the following problem:
×Wr1
Wr2
Wrn
Residual input matrixXr is defined by:
and decomposed into:
2. On a second iteration, i2NMFdecomposes the residual effect, which wasnot explained by the shared decomposition
≈ × HrnWrn
Sig. 5
Sig. 4
Sig. 1
Sig. 6
Sig. 3
Sig. 2
Sig. 7
Oligodendrocyte
Neuron
Endothelial
AstrocyteMuralOPCs
Microglia
Ependymal
Polydendrocyte 0 1 2 3 4 5
-log10(p-value)
Gene SetLein astrocytemarkersLein oligodendrocytemarkersLein neuronmarkersGO endotheliumdevelopmentGO regulation ofimmune responseGO regulation ofneuron differentiation
Gene setenrichmentCell type enrichment
Enrichment Z-score
−1000100200
0.000.050.100.150.20
Human Mouse
i2NMFstage2nd1stEx
plained
variance
−10 0 10UMAP1
−10
0
10
−10 0 10UMAP1
UMAP2
Cell type●●●●
AstrocyteEndothelialEpendymalMicroglia
●●●●●
MuralNeuronOligodendrocyteOPCsPolydendrocyte
Species●
●
HumanMouse
−10
−5
0
5
−5 0 5UMAP1
UMAP2 Mouse
PolydendrocyteSub-types●●●●●
Tnr Bmp4Tnr Cspg5Tnr Cspg5 Dad1TnrOpalinTnr Tmem2
2318272443212541
CSPG5
Mouse residualsignatures
OPALINBMP4TNR
ExposureHrmouse
LowHigh
c. Cell type and gene set enrichment analysisrevealed that each shared signature
corresponds to cell types.
Human & Mousesubstantia nigra (SN)scRNA-seq data
a. The human andmouse SN data setswere integrated overthe set of sharedgenes using i2NMF.
b. The shared Signatures identified in the firstintegrative step, were able to combine humanand mouse cells (left) and resolve groups of the
most relevant cell types in the SN.
d. The second stage of i2NMFrecovered species-specificSignatures, that helped to resolvecellular sub-types (top) and weredefined by marker genes (bottom).
40,453Cells
Welchet al., 2019
5,127 genesin common
51,912Cells
Saunderset al., 2018
Shared Sig. 1
Shared Sig. 2
ATAC-seq Sig. 3
Cell type
Exposure Hshared
LowHigh
Exposure HATACseq
LowHigh
Cell typeMorulaBlastocyst
0.0
0.2
0.4
0.6
0.8
ATAC-seq RNA-seq
i2NMFstage2nd1stEx
plained
variance
Cell typeMorulaBlastocystICM
Regulatoryrelationship
Regulatory
relationships
PresentNot present
Morula
Blastocyst
ICM
●
●●
●
●●
●
●●
0%
50%
100%
Relative Expression
● KLF17 ● NANOG● OCT4
b. The shared H matrix was able to recovertwo cell specific signatures. On the seconditeration for the ATAC-seq data, a definedsignature was decomposed for two cells.
Human embryosMorula and blastocyst
scCAT-seq data
a. The human embryo scCAT-seq data set was integratedover all 72 cells. For the geneexpression data, the majority ofthe explained variance wascaptured in the first stage ofi2NMF, interestingly for thechromatin accessibility thesecond stage also recovered aa considerable fraction of thevariance.
c. The decomposedshared signatureswhere stable across arange of factorizationranks, showing a clearseparation betweenmorula and blastocystcells.
d. The set of chromatin accessible regionsassociated with the ATAC-seq Sign. 3 and itstargets genes showed a specific pattern fortwo blastocyst cells. These also showed higherexpression in marker genes for cells of theinner cell mass (ICM). Thus, allowing theidentification of this rare cell type.
ATAC-seqK2
K3
K4
K5
K6Factorization
rank
Morula
Blastocyst
BlastocystMorula
RNA-seq72 Cells
gene expression andchromatin accessibility
for every cellLiu
et al., 201916,501 expressed genes
(RNA-seq)&
42,713 identified peaks(ATAC-seq)
Celltype