7/10/07 - sede'07 1 data mining applications margaret h. dunham southern methodist university...
Post on 22-Dec-2015
220 views
TRANSCRIPT
7/10/07 - SEDE'07 17/10/07 - SEDE'07
DATA MINING DATA MINING APPLICATIONSAPPLICATIONS
Margaret H. DunhamMargaret H. DunhamSouthern Methodist UniversitySouthern Methodist University
Dallas, Texas 75275Dallas, Texas 75275
This material is based in part upon work supported by the National Science Foundation under Grant No. This material is based in part upon work supported by the National Science Foundation under Grant No. 98208419820841
Some slides used by permission from Some slides used by permission from Dr Eamonn Keogh; Dr Eamonn Keogh; University of California Riverside;[email protected]
7/10/07 - SEDE'07 27/10/07 - SEDE'07
The 2000 ozone hole over the antarctic seen by EPTOMS http://jwocky.gsfc.nasa.gov/multi/multi.html#hole
7/10/07 - SEDE'07 37/10/07 - SEDE'07
OBJECTIVE
Explore some of the applications of data mining techniques.
7/10/07 - SEDE'07 47/10/07 - SEDE'07
Data Mining Applications Outline
Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)
Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics
Conclusions
7/10/07 - SEDE'07 57/10/07 - SEDE'07
Data Mining Overview
Finding hidden information in a database Fit data to a model
You must know what you are looking for You must know how to look for you
7/10/07 - SEDE'07 67/10/07 - SEDE'07
“If it looks like a duck,
walks like a duck, and
quacks like a duck, then
it’s a duck.”
Description Behavior AssociationsClassification Clustering Link Analysis (Profiling) (Similarity)
“If it looks like a terrorist,
walks like a terrorist, and
quacks like a terrorist, then
it’s a terrorist.”
7/10/07 - SEDE'07 77/10/07 - SEDE'07
Classification Applications
Teachers classify students’ grades as A, B, C, D, or F.
Letter Recognition andwriting Recognition Phishing: http://computerworld.com/action/article.do?
command=viewArticleBasic&taxonomyName=cybercrime_hacking&articleId=9002996&taxonomyId=82
Pluto: http://www.npr.org/templates/story/story.php?storyId=5705254
7/10/07 - SEDE'07 87/10/07 - SEDE'07
Grasshoppers
Katydids
Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is.
(c) Eamonn Keogh, [email protected]
Classification Example
7/10/07 - SEDE'07 97/10/07 - SEDE'07
An
tenn
a L
engt
hA
nte
nna
Len
gth
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
Grasshoppers Katydids
Abdomen LengthAbdomen Length
(c) Eamonn Keogh, [email protected]
7/10/07 - SEDE'07 107/10/07 - SEDE'07
Clustering Applications
Targeted Marketing Determining Gene Functionality Identifying Species
Clustering vs. Classification No prior knowledge Number of clusters Meaning of clusters
Unsupervised learning
7/10/07 - SEDE'07 127/10/07 - SEDE'07
What is SimilarityWhat is Similarity??
(c) Eamonn Keogh, [email protected]
7/10/07 - SEDE'07 137/10/07 - SEDE'07
Association Rules Applications
People who buy diapers also buy beer If gene A is highly expressed in this disease then gene B is
also expressed Relationships between people www.amazon.com Book Stores Department Stores Advertising Product Placement
7/10/07 - SEDE'07 147/10/07 - SEDE'07
Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.
DILBERT reprinted by permission of United Feature Syndicate, Inc.
7/10/07 - SEDE'07 157/10/07 - SEDE'07
Data Mining Applications Outline
Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)
Applications
Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics
Conclusions
7/10/07 - SEDE'07 177/10/07 - SEDE'07
Fraud Detection
Identify fraudulent behavior Used Extensively in financial, law enforcement, health
care, etc. sectors http://www.aaai.org/AITopics/html/fraud.html SPSS:
http://www.spss.com/predictiveclaims/fraud_detection.htm Neural Technologies:
http://www.neuralt.com/fraud_management.html
7/10/07 - SEDE'07 187/10/07 - SEDE'07
Law Enforcement
Identify suspect behavior and relationships I2 Inc.
Investigative analytic/visualization software http://www.i2inc.com
Social Network Analysis – Analyze patterns of relationships
Relationships: personal, religious, operational, etc.
7/10/07 - SEDE'07 197/10/07 - SEDE'07
Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.
7/10/07 - SEDE'07 207/10/07 - SEDE'07
Data Mining Applications Outline
Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)
Applications Fraud Detection & Illegal Activities
Facial Recognition Cheating & Plagiarism Bioinformatics
Conclusions
7/10/07 - SEDE'07 217/10/07 - SEDE'07
How Stuff Works, “Facial Recognition,” http://computer.howstuffworks.com/facial-recognition1.htm
7/10/07 - SEDE'07 227/10/07 - SEDE'07
Facial Recognition
Based upon features in face Convert face to a feature vector Less invasive than other biometric techniques http://www.face-rec.org http://computer.howstuffworks.com/facial-
recognition.htm SIMS:
http://www.casinoincidentreporting.com/Products.aspx
7/10/07 - SEDE'07 237/10/07 - SEDE'07(c) Eamonn Keogh, [email protected]
7/10/07 - SEDE'07 247/10/07 - SEDE'07
Data Mining Applications Outline
Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)
Applications Fraud Detection & Illegal Activities Facial Recognition
Cheating & Plagiarism Bioinformatics
Conclusions
7/10/07 - SEDE'07 257/10/07 - SEDE'07
Cheating on Multiple Choice Tests
Similarity between tests based on number of common wrong answers.
(George O. Wesolowsky, “Detecting Excessive Similarity in Answers on Multiple Choice Exams,” Journal of Applied Statistics, vol 27, no 7,200, pp909-923.)
The number of common correct answers is often ignored. H-H Index (D.N. Harpp, J.J. Hogan, and J.S. Jennings, 1996, “Crime in the
Classroom – Part II, and update,” Journal of Chemical Education, vol 73, no 4, pp 349-351):
H-H = (Number of exact answers in common)(Number of different answers)
7/10/07 - SEDE'07 267/10/07 - SEDE'07
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
7/10/07 - SEDE'07 277/10/07 - SEDE'07
No/Little Cheating
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
7/10/07 - SEDE'07 287/10/07 - SEDE'07
Rampant Cheating
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
7/10/07 - SEDE'07 297/10/07 - SEDE'07
Data Mining Applications Outline
Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)
Applications Fraud Detection & Illegal Activities Facial Recognition
Cheating & Plagiarism Bioinformatics
Conclusions
7/10/07 - SEDE'07 307/10/07 - SEDE'07
DNA
Basic building blocks of organisms
Located in nucleus of cells Composed of 4
nucleotides Two strands bound
together
http://www.visionlearning.com/library/module_viewer.php?mid=63
7/10/07 - SEDE'07 317/10/07 - SEDE'07
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Central Dogma: DNA -> RNA -> Protein
www.bioalgorithms.info; chapter 6; Gene Prediction
7/10/07 - SEDE'07 327/10/07 - SEDE'07
miRNA
Short (20-25nt) sequence of noncoding RNA Known since 1993 but significance not widely
appreciated until 2001 Impact / Prevent translation of mRNA Generally reduce protein levels without impacting mRNA
levels (animal cells) Functions
Causes some cancers Guide embryo development Regulate cell Differentiation Associated with HIV …
7/10/07 - SEDE'07 337/10/07 - SEDE'07
Questions
If each cell in an organism contains the same DNA –
How does each cell behave differently? Why do cells behave differently during
childhood/? What causes some cells to act differently –
such as during disease? DNA contains many genes, but only a few are
being transcribed – why? One answer - miRNA
7/10/07 - SEDE'07 347/10/07 - SEDE'07http://www.time.com/time/magazine/article/0,9171,1541283,00.html
7/10/07 - SEDE'07 357/10/07 - SEDE'07
Human Genome
Scientists originally thought there would be about 100,000 genes
Appear to be about 20,000 WHY?
Almost identical to that of Chimps. What makes the difference?
Visualization from UCRdnaQT.mov
Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)
7/10/07 - SEDE'07 367/10/07 - SEDE'07
RNAi – Nobel Prize in Medicine 2006
Double stranded RNA
Short Interfering RNA (~20-25 nt)
RNA-Induced Silencing Complex
Binds to mRNA
Cuts RNA
siRNA may be artificially added to cell!
Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3
7/10/07 - SEDE'07 377/10/07 - SEDE'07
Computer Science & Bioinformatics
Algorithms Data Structures Improving efficiency Data Mining Biologists don’t usually understand or even
appreciate what Computer Science can do Issues:
Scalability Fuzzy
We will look at: Microarray Clustering TCGR
7/10/07 - SEDE'07 387/10/07 - SEDE'07
Affymetrix GeneChip® Array
http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx
7/10/07 - SEDE'07 397/10/07 - SEDE'07
Microarray Data Analysis
Each probe location associated with gene Measure the amount of mRNA Color indicates degree of gene expression Compare different samples (normal/disease) Track same sample over time Questions
Which genes are related to this disease? Which genes behave in a similar manner? What is the function of a gene?
Clustering Hierarchical K-means
7/10/07 - SEDE'07 407/10/07 - SEDE'07
Microarray Data - Clustering
"Gene expression profiling identifies clinically relevant subtypes of prostate cancer"
Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816,
January 20, 2004
7/10/07 - SEDE'07 417/10/07 - SEDE'07
miRNA Research Issues
Predict / Find miRNA in genomic sequence Predict miRNA targets Identify miRNA functions
7/10/07 - SEDE'07 427/10/07 - SEDE'07
Temporal CGR (TCGR) 2D Array
Each Row represents counts for a particular window in sequence• First row – first window• Last row – last window • We start successive windows at the next character location
Each Column represents the counts for the associated pattern in that window
• Initially we have assumed order of patterns is alphabetic Size of TCGR depends on sequence length and subpattern
length
7/10/07 - SEDE'07 437/10/07 - SEDE'07
TCGR Example (cont’d)
TCGRs for Sub-patterns of length 1, 2, and 3
7/10/07 - SEDE'07 447/10/07 - SEDE'07
TCGR – Mature miRNA(Window=5; Pattern=3)
All Mature
Mus Musculus
Homo Sapiens
C Elegans
ACG CGC GCG UCG
7/10/07 - SEDE'07 457/10/07 - SEDE'07
POSITIVE
NEGATIVE
TCGRs for Xue Training Data
C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
7/10/07 - SEDE'07 477/10/07 - SEDE'07
Data Mining Applications Outline
Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)
Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics
Conclusions
7/10/07 - SEDE'07 487/10/07 - SEDE'07
Conclusions
Not magic Doesn’t work for all applications Stock Market Prediction Issues
Privacy Data
Here are some infamous examples of failed data mining applications
7/10/07 - SEDE'07 517/10/07 - SEDE'07
http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236
7/10/07 - SEDE'07 527/10/07 - SEDE'07
BIG BROTHER ? Total Information Awareness
http://infowar.net/tia/www.darpa.mil/iao/index.htm http://www.govtech.net/magazine/story.php?id=45918 http://en.wikipedia.org/wiki/Information_Awareness_Office
Terror Watch List http://www.businessweek.com/technology/content/may2005/tc20050
511_8047_tc_210.htm http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ http://blogs.abcnews.com/theblotter/2007/06/fbi_terror_watc.html http://www.thedenverchannel.com/news/9559707/detail.html
CAPPS http://www.theregister.co.uk/2004/04/26/airport_security_failures/ http://www.heritage.org/Research/HomelandDefense/BG1683.cfm http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ http://en.wikipedia.org/wiki/CAPPS