wpi center for research in exploratory data and information analysis credia kddrg research projects...
TRANSCRIPT
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
KDDRG Research Projects
Prof. Carolina [email protected]
Department of Computer ScienceWorcester Polytechnic Institute
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Some Current Analytical Data Mining Research Projects at WPI
• Mining Complex Data: Set and Sequence Mining– Systems performance Data– Sleep Data– Financial Data– Web Data
• Data Mining for Genetic Analysis– Correlating genetic information with diseases– Predicting gene expression patterns
• Data Mining for Electronic Commerce– Collaborative and Content-Based Filtering
• Using Association Rules and using Neural Networks
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
(Source: http://www. blsc.com)
DATA SETClinical (sequential)
Electro-encephalogram (EEG),
Electro-oculogram (EOG),
Electro-myogram (EMG),
Probe measuring flow of Oxygen
in blood etc.
Purpose: Associations between sleep patterns and health/pathology
Obtain patterns of different sleep stages (4 sleep+REM +Wake)
Potential Rules:
(A) Association Rules
(Sleep latency <3 min) & (hereditary disorder) => Narcolepsy confidence=92%, support= 13%
(B) Classification Rules
(snoring= HEAVY) & (AHI* > 30/hour): severe OSA***
=> (Race = Caucasian) confidence=70%, support= 8%
*AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea
Analyzing Sleep Data
Diagnostic (tabular)
Questionnaire responses
Patient’s demographic info.
Patient’s medical history
WPI, UMassMedical, BC
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
{depression, fatigue} 27 M 5
{stroke,
dementia,
fatigue}
97,72,67,80,… 73 90,92,96,89,86,… F 23
{arthritis} 102,99,87,96,… 49 97,100,82,80,70,…
M 14
… … … … … …
Input Data• Each instance: [Tabular | set | sequential] * attributesattr1 attr2 attr3 attr4 attr5 [class] illnesses heart rate age oxygen gender Epworth
P1
P2
P3
…
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Analyzing Financial Data
• Sequential data – daily stock values
• “Normal” (tabular/relational) data
– sector (computers, agricultural, educational, …), type of government, product releases, companies awards, …
• Desired rules:– If DELL’s stock value increases & 1999<year<2002 =>
IBM’s stock value decreases
WPI Center for Research in Exploratory Data and Information Analysis CREDIAEvents – Financial Data
Basic events: 16 or so financial templates [Little&Rhodes78]
difficult pattern matching – alignments and time warping
Rounding Top Reversal Descending Triangle Reversal
Panic Reversal Head & Shoulders Reversal
WPI Center for Research in Exploratory Data and Information Analysis CREDIAWPI Weka
Tool for mining complex temporal/spatial associations
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Data Mining for Genetic Analysisw/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI),
and Alvarez (CS, BC)
• SNP analysis– discovering correlations between
sequence variations and diseases
• Gene expression– discovering patterns that cause a gene
to be expressed in a particular cell
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Correlating Genetics with Diseases
• Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research
• Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness.
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Genomic Data Resources
Patient Patient GenderGender
SMA TypeSMA Type
(Severity)(Severity)
SNP SNP LocationLocation
C212 C212 Father / MotherFather / Mother
AG1-CAAG1-CAFather / MotherFather / Mother
FemaleFemale SevereSevere Y272CY272C 31 / 31 / 28 2928 29 102 / 102 / 108 112108 112
MaleMale MildMild Y272CY272C 28 2928 29 / 25 / 25 108 112108 112 / 114 / 114
Wirth, B. et al. Journal of Human Molecular Genetics
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Our System: CAGE
To predict gene expression based on DNA sequences.
Muscle Cell
Neural Cell
Seam Cells
CAGE
Gene 1Gene 2
Gene 3
Gene 1Gene 2
Gene 3
Gene 3Gene 2
Gene 1On
Off
WPI Center for Research in Exploratory Data and Information Analysis CREDIAGene expression Analysis
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT
PR1 PROMOTER(S) CELL TYPES
PR2
PR3
PR4
PR5
PR6
PR7
PR8
PR9
M1
M1
M1
M1
M1
M1
M2
M2
M2
M2
M4
M4
M4
M4
M4
M4
M4
M5
M5
M5
M5
M5
M5
M3
M3
M3
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Gene 9
neural
neural
muscle
neural
muscle
neural
neural
neural
muscle
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
..CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGA
TRANSCRIPTIONAL PROTEINS
MUSCLE CELL
Gene Expression
GENE
• Transcription of DNA into RNA
PROMOTER REGION
TF 1
TF 2 TF 3
M1 M4 M2
240 100
MOTIFS M1, M2, M4
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT
PR1 PROMOTER(S)
PR2
PR3
PR4
PR5
PR6
PR7
PR8
PR9
M1
M1
M1
M1
M1
M1
M2
M2
M2
M2
M4
M4
M4
M4
M4
M4
M4
M5
M5
M5
M5
M5
M5
M3
M3
M3
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Gene 9
neural
neural
muscle
neural
muscle
neural
neural
neural
muscle
R1: M1, M4, M5 => Neural
supp =22%, conf=100%
[Supp. instances: PR1, PR2]
R2: M2, M4, M5 => Neural
supp =22% , conf=100%
[Supp. instances: PR1,PR8]
WPI Center for Research in Exploratory Data and Information Analysis CREDIA“Well-clustered” motifs
M1
M1
M1
M1
M1
M1
M2
M2
M2
M2
M4
M4
M4
M4
M4
M4
M4
M5
M5
M5
M5
M5
M5
M3
M3
M3
240240
190190
150150120120
210210 100100
21211818
6060
150150100100
360360
100100 350350
260260 210210
110110
Coefficient of variation of distances (cvd) between
two motifs:
),( MkMjIRncvd
),(
),(
MkMj
MkMj
IRn
IRn
IR1={M1,M2,M5}(M1,M2) = 120.1(M1,M2) = 216.6cvd(M1,M2) = 0.55
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Distance-based Association Rules
• Given: – min-support
– min-confidence
– max-cvd
thresholds
• Mine:– all distance-based
association rules
Sample distance-based assoc. ruleR1: M1, M2, M5=>Neural (sup=33%, conf=100%)
M2 M5
cvd 0.554 0.076
mean 216.6 462.0
M1
sdev 120.1 35.0
cvd 0.433
mean 237.0
M2
sdev 103.0
WPI Center for Research in Exploratory Data and Information Analysis CREDIAGrad. & Undergrad. Students
• Ali Benamara.• Dharmesh Thakkar.• Senthil K Palanisamy.• Zachary Stoecker-Sylvia.• Keith A. Pray. • Jonathan Freyberger.• Maged El-Sayed.• Parameshvyas
Laxminarayan. • Aleksandar Icev. • Wendy Kogel. • Michael Sao Pedro.• Christopher Shoemaker. • Weiyang Lin.
• Jonathan Rudolph• Eduardo Paredes• Iavor N. Trifonov. • Takeshi Kawato• Cindy Leung and Sam Holmes. • John Baird (BB), Jay Farmer, Rebecca Gougian (BB), Ken Monterio (BB),
Paul Young. • Zachary Stoecker-Sylvia.
Kristin Blitsch (BB), Ben Lucas, Sarah Towey(BB)• Wendy Kogel, Brooke LeClair, Christopher St. Yves. • Brian Murphy, David Phu (CS/BB), Ian Pushee, Frederick Tan (CS/BB). • Daniel Doyle, Jared Judecki, James Lund, Bryan Padovano (BB).• Christopher Cole. • Michael Ciman and John Gulbrandsen. • Tara Halwes • Christopher Martino. • Matthew Berube. • Anna Novikov. • Amy Kao and Dana Rock.