spanish inquisition

14
Spanish Inquisition Final Project Week 2 - 4/29/09 Breast Cancer Gene Expression Data Leon Kay, Yan Tran, Chris Thomas Chris Yan Leon

Upload: hollie

Post on 05-Jan-2016

72 views

Category:

Documents


1 download

DESCRIPTION

Chris. Leon. Spanish Inquisition. Yan. Final Project Week 2 - 4/29/09 Breast Cancer Gene Expression Data Leon Kay, Yan Tran, Chris Thomas. Weka Filtering. Used CFS with BestFirst Search Reduced the number of attributes from 1544 to 125 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Spanish Inquisition

Spanish InquisitionFinal Project Week 2 - 4/29/09

Breast Cancer Gene Expression Data

Leon Kay, Yan Tran, Chris Thomas

Chris

Yan

Leon

Page 2: Spanish Inquisition

Weka Filtering

• Used CFS with BestFirst Search• Reduced the number of attributes from

1544 to 125• CFS stands for Correlation-based Feature

Selection. Basic hypothesis: “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.” [1]

Page 3: Spanish Inquisition

CFS Algorithm - Searching

• Any search algorithm can be plugged into CFS – author describes three - forward selection, backward elimination, and best first. They are all essentially greedy heuristic search algorithms. The greedy search approach reduces the complexity of generating the feature subset.

• “Best first can start with either no features or all features. In the former, the search progresses forward through the search space adding single features; in the latter the search moves backward through the search space deleting single features. To prevent the best first search from exploring the entire feature subset search space, a stopping criterion is imposed. The search will terminate if five consecutive fully expanded subsets show no improvement over the current best subset.” [1]

Page 4: Spanish Inquisition

CFS Algorithm Visual Diagram [1]

Page 5: Spanish Inquisition

Accuracy (Error Rate) of algorithms before and after applying CFS/BestFit filtering

Before* After** Error Rate Reduction

J48 32.17 28.02 12.92

Bagging (J48) 18.26 16.38 10.30

Boosting (J48) 20.87 16.38 21.52

Random Forests 15.65 14.22 9.12

SMO (SVM) 15.22 14.22 6.53

* From Week1 - all 1544 Attributes

** After applying CFS/BestFit filtering, 125 attributes

Page 6: Spanish Inquisition

ROC – Receiver Operating Characteristic

• ROC graphs “depict the tradeoff between hit rates and false alarm rates of classifiers “ [2]

• “one point in ROC space is better than another if it is to the northwest (tp rate is higher, fp rate is lower, or both) of the first” [2]

• Therefore, Area Under Curve, or AUC is an accurate numerical value that can be used to compare classifiers.

Page 7: Spanish Inquisition

ROC Data – Area under Curve

J48 Bagging (J48) Boosting (J48) Random Forests SMO (SVM)

Basal-like 0.8978 0.9851 0.9883 0.9939 0.9802

Claudin-low 0.9515 0.9993 0.9975 0.9979 0.9977

HER2+/ER- 0.8137 0.9614 0.964 0.9476 0.9313

Luminal A 0.856 0.9558 0.9497 0.9735 0.9418

Luminal B 0.7842 0.93 0.9183 0.9563 0.9336

Normal Breast-like 0.7676 0.9731 0.922 0.9772 0.955

Page 8: Spanish Inquisition

Example ROC – Random Forests

Page 9: Spanish Inquisition

MeV Analysis

• Initial Hierarchical Clustering

Page 10: Spanish Inquisition

Analyze the Cluster

Page 11: Spanish Inquisition

FLJ13710 and GATA3

Lowly expressed in basal-like samples.Highly expressed in luminal samples.

Page 12: Spanish Inquisition

GATA3

• GATA3 levels are a known indication of breast cancer prognosis. (Basal-like is worse than Luminal.)

• Associated with estrogen receptor alpha, which is often highly expressed in the early stages of breast cancer.

Page 13: Spanish Inquisition

FLJ13710

• Mentioned in a paper on finding prognostic signatures for breast cancer.

• Couldn’t find any in-depth studies on this gene.

Page 14: Spanish Inquisition

References1) Mark Hall, “Correlation-based Feature Selection for Machine

Learning”, http://www.cs.waikato.ac.nz/~mhall/thesis.pdf2) Tom Fawcett, “An introduction to ROC analysis“,

doi:10.1016/j.patrec.2005.10.010 – enter into http://dx.doi.org/3) Wilson, Brian J., Giguère, Vincent. “Meta-analysis of human cancer

microarrays reveals GATA3 is integral to the estrogen receptor alpha pathway”, Molecular Cancer 2008, 7:49. http://www.molecular-cancer.com/content/7/1/49

4) Hayashi, SI., et al. “The expression and function of estrogen receptor alpha and beta in human breast cancer and its clinical application”, http://erc.endocrinology-journals.org/cgi/content/abstract/10/2/193

5) “Suppl. Table 2: List of probe sets significantly differentially expressed between luminal cell lines and basal cell lines. Probe sets are ordered according to decreasing DS (discriminating score). “www.nature.com/onc/journal/v25/n15/extref/1209254x4.xls

6) Carrivick, L., et al. “Identification of Prognostic Signatures in Breast Cancer Microarray Data using Bayesian Techniques.” http://www.enm.bris.ac.uk/cig/pubs/2005/rs4.pdf