Tensor decompositionbased unsupervisedfeature extraction applied to matrix products
for multiview data processing
Yh. Taguchi
Department of Physics, Chuo UniversityTokyo, Japan.
PLoS ONE 12(8): e0183933. PLoS ONE 12(8): e0183933. DOI: 10.1371/journal.pone.0183933DOI: 10.1371/journal.pone.0183933
What's typical in Bioinformatics?What's typical in Bioinformatics?
Small samples(a few), variables(=genes)arehuge(~104)→a typical “large p small n” problem
Difficult to apply usual statistical analyses
ex. small samples deep learning → דlarge p small n” problem→sparse modeling (lasso)variable selections ×
Approaches specific to bioinformatics are required
Purpose: multiview data analysis
persons×
features
persons
features
persons×
shoppings
shoppings
features:A,B,D,M
persons:β,δ,μ
shoppings:1,3,4
persons
matrix tensor
×xij xil
xij ×xil
xijl
Tensor decomposition
Gxik1
xjk2
xlk3
xijl=xij ×xil≒Σk1,k2,k3 Gk1,k2,k3
xik1xjk2
xlk3
i:personsj:featuresl:shoppings
Demonstration using synthetic data set
50 50
1000+20%ノイズ
50
100%noise
No correlationsNo correlations
++
50
+20%ノイズ
50×1000×1000
tensor
Tensor decomposition
xik1
k1=1
1≦i 50≦
k1=2 k1=3
xjk2
k2=1
k2=2
xlk2
k3=1
k3=2
1≦j 1000≦ 1≦l 1000≦
persons
features shoppings
Advantages as multiview data analysis toolsAdvantages as multiview data analysis tools
・No weights required to integrate multiple views・Complete unspervised learning
(no model buildings using preknowledge)・smaller computational resources because of linearity
Disadvantages....
・tendency to require more memoriesSolution:summing up Σi xij ×xil results in j×l matrix that can be converted back (explains omitted)。
・no shared feature or samples result in four mode.
Feature extractionFeature extraction No real data separated well
Assume Gaussian
Detect outliers
Pi=P [ >∑k(x ikσ )
2
]
BenjaminiHochberg corrected P <0.01
Pvalues by χ2 dist
P(p)
1p0 1
Applications:multiomics data
mRNAsample1
sample2
sample3
sample4
sample5
miRNA
A group
B group
activeactive
expression interaction
xij ×xil i:161samples, j:13393mRNA, l:755miRNA,(8 groups)
Selection of xik1distinct between symptoms
k1=1 k1=2 k1=3 k1=4 k1=5
1≦k1 5 are symptom dependent≦Pvalue
k2 k3 k1 G(k1,k2,k3)
1≦k1 k2 k3 5≦
k1 :samplek2 :mRNA k3 :miRNA
1≦ k2 5≦Larger G
Smaller G
1≦ k3 2≦
xjk2xlk3
assume Gaussian
Detect outliers
BenjaminiHochberg corrected P <0.01
Pvalues by χ2 dist
755miRNA中7miRNA13393mRNA中427mRNA(Biological validations omitted)
SummarySummary
・ As a feature selection in multi view data, after applying tensor decomposition to a tensor generated by product of matrices, I propose to select features associated with BHcorrected Pvalues <0.01 computed by χ2 dist assumed for a mode.
・ As for synthetic data set, apparently uncorrelated variables embedded into noised are decomposed to original orthogonal vectors after identifying correlated variables.
・As for muli omics data set, a few (a few %) intercorrelated and biologically reasonable miRNAs and mRNAs are identified among huge number of mRNAs and miRNAs
My presentation in GIW2017:GIW 7 RNA Bioinformatics2nd Nov. Morning (c.a. 10 AM)
at Adonis (1F)
Tensor decompositionbased unsupervised feature extraction identified the universal nature of sequencenonspecific offtarget
regulation of mRNA mediated by microRNA transfection