medical data mining carlos ordonez university of houston department of computer science

45
Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Upload: ralph-jenkins

Post on 05-Jan-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Medical Data Mining

Carlos Ordonez

University of Houston

Department of Computer Science

Page 2: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Outline• Motivation

• Main data mining techniques:

– Constrained Association Rules

– OLAP Exploration and Analysis

• Other classical techniques:

– Linear Regression

– PCA

– Naïve Bayes

– K-Means

– Bayesian Classifier

2/45

Page 3: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Motivation: why inside a DBMS?

• DBMSs offer a level of security unavailable with flat files.

• Databases have built-in features that optimize extraction and simple analysis of datasets.

• We can increase the complexity of these analysis methods while still keeping the benefits offered by the DBMS.

• We can analyze large amounts of data in an efficient manner.

3/45

Page 4: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Our approach

• Avoid exporting data outside the DBMS• Exploit SQL and UDFs• Accelerate computations with query optimization

and pushing processing into main memory

4/45

Page 5: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Constrained Association Rules

• Association rules – technique for identifying patterns in datasets using confidence

• Looks for relationships between the variables• Detects groups of items that frequently occur

together in a given dataset• Rules are in the format X => Y

• The set of items X are often found in conjunction with the set of items Y

5/45

Page 6: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

The Constraints

• Group Constraint

• Determines which variables can occur together in the final rules

• Item Constraint

• Determines which variables will be used in the study

• Allows the user to ignore some variables

• Antecedent / Consequent Constraint

• Determines the side of the rule that a variable can appear on

6/45

Page 7: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Experiment

• Input dataset: p=25, n=655

• Three types of Attributes:

– P: perfusion measurements

– R: risk factor

– D: heart disease measurements

7/45

Page 8: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Experiments

• This table summarizes the impact of constraints on number of patterns and running time.

8/45

Page 9: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Experiments

• This Figure shows rules predicting no heart disease in groups.

9/45

Page 10: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Experiments

• This figure shows groups of rules predicting heart disease.

10/45

Page 11: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Experiments

• These figures show some selected cover rules, predicting absence or existence of disease.

11/45

Page 12: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

OLAP Exploration and Analysis

• Definition:– Input table F with n records

– Cube dimension: D={D1,D2,…Dd}

– Measure dimension: A={A1,A2,…Ae}

– In OLAP processing, the basic idea is to compute aggregations on measure Ai by subsets of dimensions G, GD.

12/45

Page 13: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

OLAP Exploration and Analysis

• Example:

– Cube with three dimensions (D1,D2,D3)

– Each face represents a subcube on two dimensions

– Each cell represent subcube on one dimension

13/45

Page 14: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

OLAP Statistical Tests

• We proposed the use of statistical tests on pairs of OLAP sub cubes to analyze their relationship

• Statistical Tests allow us to mathematically show that a pair of sub cubes are significantly different from each other

14/45

Page 15: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

OLAP Statistical Tests

• The null hypothesis H0 states 1=2 and the goal is to find groups where H0 can be rejected with high confidence 1-p.

• The so called alternative hypothesis H1 states 12 .

• We use a two-tailed test which allows finding a significant difference on both tail of the Gaussian distribution in order to compare means in any order (12 or 21).

• The test relied on the following equation to compute a random variable z.

15/45

1 2

2 21 1 2 2

zn n

Page 16: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Experiments

• n = 655• d = 21• e = 4• Includes patient information, habits,

and perfusion measurements as dimensions

• Measures are the stenosis, or amount of narrowing, of the four main arteries of the human heart

16/45

Page 17: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Experiment Evaluation

• Heart data set: Group pairs with significant measure differences at p=0.01

17/45

Page 18: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Experiment Evaluation

• Summary of medical result at p=0.01• The most important is OLDYN, SEX and

SMOKE.

18/45

Page 19: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Comparing Reliability of OLAP Statistical Tests and Association Rules

• Both techniques altered to bring on same plane for comparison– Association Rules: added post process pairing– OLAP Statistical Tests: added constraints

• Cases under study– Association Rules (HH) – both rules have high

confidence• AdmissionAfterOpen(1),

AorticDiagnosis(0/1)=>NetMargin(0/1)• High confidence, but also high p-value• Data is crowded around AR boundary point

19/45

Page 20: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Comparing Reliability of OLAP Statistical Tests and Association Rules

• Association Rules: High/High

– We can see that the data is crowded around boundary point for Association Rules

– Two Gaussians are not significantly different

– Conclude: both agree, OLAP Statistical Tests is more reliable

20/45

Page 21: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Comparing Reliability of OLAP Statistical Tests and Association Rules

• Association Rules: Low/Low

– Once again boundary point comes into play

– Two Gaussians are not significantly different

– Conclude: both agree

21/45

Page 22: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Comparing Reliability of OLAP Statistical Tests and Association Rules

• Association Rules: High/Low

– Ambiguous

22/45

Page 23: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Results from TMHS dataset

• Mainly financial dataset– Revolves around opening of a new medical center for treating heart patients

• Results from Association Rules– Found 4051 rules with confidence>=0.7 and support>=5%– AfterOpen=1, Elder=1 => Low Charges

• After the center opened, the elderly enjoyed low charges– AfterOpen=0, Elder=1 => High Charges

• Before the center opened, the elderly was associated with high charges• Results from OLAP Statistical Tests

– Found 1761 pairs with p-value<0.01 and support>=5%– Walk-in, insurance (commercial/medicare) => charges(high/low)

• Amount of total charges to patient depends on his/her insurance when the admission source is a walk-in

– AorticDiagnosis=0, AdmissionSource (Walk-in / Transfer) => lengthOfStay (low / high)

• If diagnosis is not aortic disease, then the length of stay depends on how the patient was admitted.

23/45

Page 24: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Machine Learning techniques

• PCA• Regression: Linear and Logistic• Naïve Bayes• Bayesian classification

24/45

Page 25: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Principal Component Analysis

• Dimensionality reduction technique for high-dimensional data (e.g. microarray data).

• Exploratory data analysis, by finding hidden relationships between attributes.

Assumptions:– Linearity of the data.– Statistical importance of mean and covariance.– Large variances have important dynamics.

25/45

Page 26: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Principal Component Analysis

• Rotation of the input space to eliminate redundancy.

• Most variance is preserved.

• Minimal correlation between attributes.

• UTX is a new rotated space.

• Select the kth most representative components of U. (k<d)

• Solving PCA is equivalent to solve SVD, defined by the eigen-problem:

X=UEVT

XXT=UE2UT

26/45

U: left eigenvectors

E: the eigenvalues

V: the right eigenvectors

Page 27: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

PCA Example

  U1 U2 U3 U4 U5 U6 U7 U8

age 0.393 0.223     -0.259 0.195 -0.405  

gender -0.293 0.454 -0.413          

on_thyroxine -0.161   0.232 0.608 -0.226 -0.100 0.162  

query_thyroxine 0.229 -0.100 -0.397     0.446 0.184  

on_antithyroid_med 0.107 0.221 -0.175 0.327 0.447   -0.204  

sick 0.171   0.019     0.131 0.208 -0.846

pregnant   0.138       -0.188   -0.194

surgery         0.246     -0.276

I131_treatment -0.108 -0.214 0.107   0.329 0.360 -0.059  

query_hypothyroid   -0.157 0.136   0.294 -0.573 -0.123 -0.251

query_hyperthyroid -0.223 0.107     -0.129 0.189    

lithium -0.134 0.159 0.421 0.217 0.247 0.319 0.216  

goitre -0.100 -0.174 0.166 -0.430 0.236 0.278 -0.178  

tumor 0.384 -0.151 -0.108   0.109 -0.110 0.697 0.155

hypopituitary   -0.156 0.195 -0.230 -0.523     -0.155

psych 0.118 -0.604   0.459     -0.276  

27/45

Page 28: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

PCA Example

28/45

  U1 U2 U3 U4 U5 U6 U7 U8

age 0.102              

chol 0.131 0.175 0.198 0.156     0.105 0.275

claudi 0.173 0.273 0.252 0.220   -0.408 0.275 0.194

diab -0.273   0.261 0.305 0.353   0.217  

fhcad   0.144 -0.420 0.266 -0.193 -0.239 0.326  

gender -0.409 0.347 -0.106   0.379   0.108 0.393

hta -0.128 -0.122 0.138   0.109 -0.152 -0.105 -0.110

hyplpd 0.217   0.183 0.195 -0.204     -0.154

pangio -0.103 -0.111   -0.347 0.224 -0.311 -0.415  

pcarsur   0.286   -0.318   -0.117 -0.217 0.370

pstroke   0.449 0.138 -0.157 -0.448 0.263   0.152

smoke   0.159 -0.323 0.417   0.123 -0.464  

lad 0.371 0.504   -0.103 0.342 -0.170 -0.160 -0.516

lcx 0.572 -0.135     0.448 0.422 0.160 0.205

lm -0.288 0.221 0.301     0.294 0.301 -0.313

rca 0.184 -0.156   -0.329 -0.141 -0.409 0.210 0.142

Page 29: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Linear Regression

• There are two main applications for linear regression: Prediction or forecasting of the output or variable of interest Y

• Fit a model from the observed Y and the input variables X.

• For values of X given without its accompanying value of Y, the model can be used to make a prediction of the output of interest Y.

• Given an input data X={x1,x2,…,xn}, with d dimensions Xa, and the response

or variable of interest Y.

• Linear regression finds a set of coefficients β to model:

Y = β0+β1X1+…+βdXd+ɛ.

29/45

Page 30: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Linear Regression with SSVS

• Bayesian variable selection Quantify the strength of the relationship between Y and a number of

explanatory variables Xa.

• Assess which Xa may have no relevant relationship with Y.

• Identify which subsets of the Xa contain redundant information about Y.

• The goal is to find the subset of explanatory variables Xγ which best predicts the output Y, with the regression model Y = βγ Xγ+ɛ.

• We use Gibbs sampling, which is an MCMC algorithm, to estimate the probability distribution π(γ|Y,X) of a model to fit the output variable Y.

• Other techniques, like stepwise variable selection, perform a partial search to find the model that better explains the output variable.

• Stochastic Search Variable Selection finds best “likely” subset of variables based on posterior probabilities.

30/45

Page 31: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Linear Regression in the DBMS

• Bayesian variable selection is implemented completely inside the DBMS with SQL and UDFs for efficient use of memory and processor resources.

• Our algorithms and storage layouts for tables in the DBMS have a representative impact on execution performance.

• Compared to the statistical package R, our implementations scale to large data sets.

31/45

Page 32: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Linear regression: Experimental results

ParametersVariables: 21n = 655Y: rcac = 100it = 10000burn =1000

Gamma Prob rSquared0,1,3,8,12,13,16,19 0.012333 0.8262270,1,3,8,12,13 0.011778 0.8384210,1,3,6,8,12,13 0.011556 0.8321250,1,3,6,8,12,13,17 0.010333 0.8268850,1,3,8,9,12,13,16,19 0.008889 0.8216470,1,3,6,8,9,12,13 0.008 0.8269930,1,3,8,12,13,17 0.007222 0.8330060,1,3,6,8,13,17 0.006889 0.8338520,1,3,6,8,9,13 0.006778 0.8385730,1,3,6,8,9,12,13,17 0.006556 0.821839

32/45

ParametersVariables: 21n = 655Y: ladc = 100it = 10000burn =1000

Gamma Prob rSquared0,1,14,18 0.061556 0.7685940,1,13,14,18 0.028556 0.76520,1,8,14,18 0.022889 0.7653960,1,9,14,18 0.014444 0.7664780,1,6,14,18 0.013222 0.7667820,1,3,14,18 0.011667 0.7671180,1,14,16,18 0.010111 0.7676450,1,14,17,18 0.01 0.7671050,1,14,18,21 0.008667 0.7682760,1,8,13,14,18 0.008333 0.762457

Variables Gammaage 1 chol 2 claudi 3 diab 4 fhcad 5 gender 6 hta 7 hyplpd 8 pangio 9 pcarsur 10 pstroke 11 smoke 12 il 13 ap 14 al 15 la 16 as_ 17 sa 18 li 19 si 20 is_ 21

Page 33: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Linear regression: Experimental results

Gamma Probability rSquared0,3,4,52,99,196,287,1833,1857,2115,2563,2601,3720,3924,4854,4879 0.761239 0.006640,3,4,52,99,196,287,1833,1857,2563,2601,3924,4854,4879 0.108891 0.0067560,3,4,52,99,196,287,1833,1857,2115,2563,2601,3924,4854,4879 0.050949 0.0067020,3,4,52,99,196,287,1833,3924,4854,4879 0.041958 0.006771

0,3,4,52,99,196,287,1833,2563,2601,3924,4854,4879 0.027972 0.0067580,3,4,52,99,196,287,1833,4854 0.002997 0.0068360,3,4,52,99,196,287,1833,4854,4879 0.001998 0.0067760,3,4,52,99,196,287,1833,2601,3924,4854,4879 0.001998 0.0067580,3,4,99,196,287,1833,4854 0.000999 0.006924

33/45

Parameters

d(γ0) 1

dimensions 4918

n 295

iterations 1000

c 1

y Cens

Cancer microarray data, where gamma are the gene numbers.

Page 34: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Logistic Regression

34/45

Similar to linear regression. The data is fitted to a logistic curve. This technique is used for the prediction of probability of occurrence of an event.

P(Y=1|x) = π(x)

π(x) =1/(1+e-g(x)) , where g(x)= β0+β1X1+β2X2+…+βdXd

Page 35: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Logistic Regression:Experimental results

35/45

Name Coefficient Name Coefficient Intercept -2.191237293 LI -0.090759713AGE 0.035740648 LA -0.210152957SEX 0.40150077 AP 0.600745945HTA 0.279865571 AS_ 0.264413463DIAB 0.060630279 SA 0.342609744CHOL 0.001882748 SI 0.04750216SMOKE 0.31437235 IS_ -0.159692182AL 0.198138067 IL 0.446180853

Model:med655

Train• n = 491• d = 15• y = LAD>=70%

Test• n = 164

Accuracy

  Global Class-0 Class-1

med655 70 74 67

Page 36: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Naïve Bayes (NB)

• Naïve Bayes is one of the most popular classifiers• Easy to understand. • Produces a simple model structure.• It is robust and has a solid mathematical

background. • Can be computed incrementally.• Classification is achieved in linear time.• However, it has an independence assumption.

36/45

Page 37: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Bayesian Classifier

• Why Bayesian:– A Bayesian Classifier Based on Class Decomposition

Using EM Clustering.– Robust models with good accuracy and low over-fit.– Classifier adapted to skewed distributions and

overlapping set of data points by building local models based on clusters.

– EM Algorithm used to fit the mixtures per class.– Bayesian Classifier is composed of a mixture of k

distributions or clusters per class.

37/45

Page 38: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Bayesian ClassifierBased on K-Means (BKM)

• Motivation

– Bayesian Classifiers are accurate and efficient.

– A Generalization of the Naïve Bayes algorithm.

– Model accuracy can be tuned varying number of clusters, setting class priors and making a probability-based decision.

– EM is a distance based clustering algorithm.

– Two phases involved in building the predictive model• Building the predictive model.

• Scoring a new data set based on the computed predictive model.

38/45

Page 39: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Example

• Medical Dataset is used with 655 rows n with varying number of clusters k.

• This Dataset has 25 dimensions d which includes diseases to be predicted, risk factors and perfusion measurements.

• Dimensions having null values have been replaced with the mean of that dimension.

• Here, we predict accuracy for LAD, RCA (2 diseases).

• Accuracy is good for maximum k = 8.

39/45

Page 40: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Example: medical

med655• n = 655• d = 15• g= 0,1• G represents if the patient developed

heart disease or not.wbcancer• n = 569• d = 7• g= 0,1• G represents if the cancer is benign

or malignant.• Features describe the characteristics

of cell nuclei obtained from image of breast mass.

40/45

Accuracy

    Global Class-0 Class-1

med655 NB 67 83 53

  BKM 62 53 70

         

wbcancer NB 93 91 95

  BKM 93 84 97

Page 41: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

BKM & NB Models

41/45

g MEAN_VAR AGE SEX HTA CHOL SMOKE0 MEAN 58.6 0.64 0.4 219.47 0.570 VAR 147.92 0.23 0.24 1497.45 0.251 MEAN 63.9 0.74 0.45 218.34 0.621 VAR 128.5 0.19 0.25 957.69 0.24

g MEAN_VAR x3 x5 x12 x18 x260 MEAN 115.71 0.1 1.2 0.02 0.370 VAR 438.72 0 0.26 3.35E-05 0.031 MEAN 78.18 0.09 1.22 0.01 0.181 VAR 136.45 0 0.37 3.58E-05 0.01

NB: med655 NB: wbcancer

BKM: med655

g j AGE SEX HTA CHOL SMOKE0 1 4.49 0 0.97 5.3 1.820 2 4.36 2.08 1.07 5.49 0.480 3 5.09 0.08 1.25 6.35 0.210 4 5.1 2.08 0.37 5.59 1.781 1 6.28 1.75 0.96 6.97 2.061 2 6.45 1.31 0.74 6.98 01 3 4.64 1.82 0.88 7.24 2.061 4 4.7 1.75 1.03 7.04 0

BKM: wbcancer

g j x3 x5 x12 x18 x260 1 6.56 8.27 2.1 2.97 2.80 2 5.44 7.32 2.02 2.07 1.630 3 4.68 8.94 2.18 2.46 3.120 4 5.42 8.37 4.18 3.89 1.791 1 6.29 6.12 2.12 0.96 1.061 2 6.97 7.12 2.16 3.07 3.591 3 5.92 7.83 2.45 1.9 1.741 4 7.49 6.68 1.48 1.49 2.02

Page 42: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Cluster Means and Weights

42/45

•Means are assigned around the global mean based on Gaussian initialization.•Table below shows means of clusters having 9 dimensions (d).•The weight of a cluster is given by 1.0/k, where k is the number of clusters.

Class Means Weight

AGE SEX DIAB HYPLPD FHCAD SMOKE

CHOL LA AP

0 60 0.721 0.209 0.209 0.116 0.698 185 -0.178

-0.331 0.0754

0 76.5 0.632 0.08 0.488 0.056 0.488 223 -0.225

-0.37 0.219

0 42.2 0.754 0.029 0.667 0.261 0.58 224 -0.505

-0.715 0.121

0 65.1 0.753 0.193 0.602 0.0904

0.566 223 -0.22 -0.375 0.291

0 56.5 0.652 0.261 0.217 0.261 0.565 139 -0.379

-0.527 0.0404

0 54.2 0.729 0.132 0.583 0.104 0.66 223 -0.26 -0.519 0.253

1 51.9 0.533 0.2 0.933 0.267 0.733 269 0.0233

-0.577 0.176

1 59.7 0.333 0.333 0.889 0 0.667 318 -0.494

-0.748 0.212

1 48 0.4 0.2 0.8 0.2 0.8 201 -0.68 -0.462 0.0588

1 67.1 0.444 0.222 0.889 0.111 0.593 252 -0.474

-0.645 0.318

1 53 0.5 0 1 0.5 0.75 456 -0.512

-1 0.0471

1 72.7 0.75 0.313 0.438 0 0.625 202 -0.782

-0.229 0.188

Page 43: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Prediction of Accuracy Varying k

(Same Clusters k per Class)

43/45

Dimensions = 21 (Perfusion Measurements + Risk factors)

Accuracy for LAD Accuracy for RCA

k = 2 65.8% 66.5%

k = 4 67.90% 68.82%

k = 6 69.89% 70.42%

k = 8 75.11% 72.67%

k = 10 68.35% 70.23%

Dimensions=9(Perfusion Measurements)

Accuracy for LAD Accuracy for RCA

k = 2 73.13% 67.63%

k = 4 73.37% 67.90%

k = 6 74.80% 69.80%

k = 8 77.07% 72.06%

k = 10 72.34% 68.93%

Page 44: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

The DBMS Group

• Students:– Zhibo Chen– Carlos Garcia-Alvarado– Mario Navas– Sasi Kumar Pitchaimalai– Ahmad Qwasmeh– Rengan Xu– Manish Limaye

44/45

Page 45: Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Publications1. Ordonez C., Chen Z., Evaluating Statistical Tests on OLAP Cubes to Compare Degree of Disease, IEEE

Transactions on Information Technology in Biomedicine 13(5): 756-765 (2009)

2. Chen Z., Ordonez C., Zhao K., Comparing Reliability of Association Rules and OLAP Statistical Tests. ICDM Workshops 2008: 8-17

3. Ordonez, C., Zhao, K., A Comparison between Association Rules and Decision Trees to Predict Multiple Target Attributes, Intelligent Data Analysis (IDA), to appear in 2011.

4. Navas, M., Ordonez, C., Baladandayuthapani, V., On the Computation of Stochastic Search Variable Selection in Linear Regression with UDFs, IEEE ICDM Conference, 2010

5. Navas, M., Ordonez, C., Baladandayuthapani, V., Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs, Proc. ACM KDD Workshop on Large-scale Data Mining: Theory and Applications (LDMTA), 2010 

6. Ordonez C., Pitchaimalai, S.K. Bayesian Classifiers Programmed in SQL, IEEE Transactions on Knowledge and Data Engineering (TKDE) 22(1): 139-144 (2010)

7. Pitchaimalai, S.K., Ordonez, C., Garcia-Alvarado, C., Comparing SQL and MapReduce to compute Naive Bayes in a Single Table Scan, Proc. ACM CIKM Workshop on Cloud Data Management (CloudDB), 2010

8. Navas M., Ordonez C., Efficient computation of PCA with SVD in SQL. KDD Workshop on Data Mining using Matrices and Tensors 2009

45/45