distributed cur decomposition for...
Post on 07-Aug-2020
1 Views
Preview:
TRANSCRIPT
DistributedCURDecompositionforBi-Clustering
StephenKline,KevinShawJune1,2016
{sakline,keshaw}@stanford.eduStanfordUniversity,CME323FinalProject
CURasalternativetoSVD– e.g.Biclustering
CompleteratingsforselectUsers
CompleteratingsforselectMovies
Movie Ratingssparse /huge
SVD– accuratebutheavy,Less interpretable(rotatedspace)
CUR– less accuratebutlight,Moreinterpretable*
*Asarchetypal usersandmovies
1
Movies
Users
User-Movie Biclusters
Biclustering wasoriginally developedinthecontextofDNAmicroarrays
Biclustering alsohaspotential inotherareasandhas addedinterpretability
Source:SourceCodeforBiologyMedicine(April2013)- "Thenon-negativematrixfactorizationtoolboxforbiologicaldatamining"
ReviewofSVD:A=U∑VT
• PRO- Highaccuracyo ksingular values/vectors produce thebestk-rank
approximation toA
• CON - Highcomputation/spacerequirementso Inour biclustering applicationwithMovieLens
data,thedistributed SVDis “roughlysquare” -ARPACK (vs.“tallandskinny” – ATAtrick)
dense /fulldense /full
Asparse /huge
densebig
dense /bigsparse/small
2
BackgroundonA=CUR
• CURtradesaccuracy…... forcomputation/spacesavings
• C/R =cols/rows fromA• U =pseudo-inverse ofW
(intersectionof CandR)
C
R
sparsebig
sparse /bigdense/small
Asparse /huge
Pinv( ) Intersectionof CandR(callitW,verysmall)
• Col/RowSelect() algsamplesw/replacement (allowsduplicates)
• Pinv(W) calculatedviaSVD(W)
• Accuracybetterforlargedatasets
3
DesignDecisionsforDistributedCUR
C
R
sparsebig
sparse /bigdense/small
Asparse /huge
KeyDesignDecision:Distributetwoinstances ofAavoidingfutureall-to-allcommunications
Only necessary tostoreCandRassetofindices intoACompute Ulocally
Therearemultiple variations ofCUR.Weselected thealgorithmaspresented in:Drineas, et.al.,2006."FastMonteCarloAlgorithms forMatricesIII"which(forexample)does notremoveduplicatecols/rows assomeothers do.
4
Serialvs.DistributedCUR- Asymptotics
• BuildCandR:o Generateprobabilities– O(mn)o CreateCmatrix– O(mk)o CreateRmatrix– O(nk)
• ConstructU• ComputeCTC– O(mk2)• SVDofCTC– O(k3)• ComputeAandB– O(k3)• U=ABT – O(k3)
5
• BuildCandR:• Generateprobabilities– O(mn +p)cost,O(maxdense)time
o Create2RDDsbyRow/Colpartition– O(mn)cost,AtoAo Bothinstances:reducetoRow/Colsums—
O(maxdense)time,nocommunicationo Oneinstance:reduceRowsumtototal– O(p)cost,O(logp)timeo Broadcasttotaltocalculateprobs – O(p)cost,O(logp)time
• CreateC/Rmatriceso Locallysamplekrows/cols– O(k)o BroadcastsampletoRDDs– O(pk)cost,O(klogp)time
• ConstructU• SameasSerial(lessopportunitytodistribute)
Serial Distributed (communication cost andcomputation time)
Biclustering:DistributedCURvsSVD- Empirics
6
top related