wpi center for research in exploratory data and information analysis credia kddrg research projects...

17
WPI Center for Research in Exploratory Data and Information Analysis CREDI A KDDRG Research Projects Prof. Carolina Ruiz [email protected] Department of Computer Science Worcester Polytechnic Institute

Upload: audrey-ledsome

Post on 14-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

KDDRG Research Projects

Prof. Carolina [email protected]

Department of Computer ScienceWorcester Polytechnic Institute

Page 2: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Some Current Analytical Data Mining Research Projects at WPI

• Mining Complex Data: Set and Sequence Mining– Systems performance Data– Sleep Data– Financial Data– Web Data

• Data Mining for Genetic Analysis– Correlating genetic information with diseases– Predicting gene expression patterns

• Data Mining for Electronic Commerce– Collaborative and Content-Based Filtering

• Using Association Rules and using Neural Networks

Page 3: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

(Source: http://www. blsc.com)

DATA SETClinical (sequential)

Electro-encephalogram (EEG),

Electro-oculogram (EOG),

Electro-myogram (EMG),

Probe measuring flow of Oxygen

in blood etc.

Purpose: Associations between sleep patterns and health/pathology

Obtain patterns of different sleep stages (4 sleep+REM +Wake)

Potential Rules:

(A) Association Rules

(Sleep latency <3 min) & (hereditary disorder) => Narcolepsy confidence=92%, support= 13%

(B) Classification Rules

(snoring= HEAVY) & (AHI* > 30/hour): severe OSA***

=> (Race = Caucasian) confidence=70%, support= 8%

*AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea

Analyzing Sleep Data

Diagnostic (tabular)

Questionnaire responses

Patient’s demographic info.

Patient’s medical history

WPI, UMassMedical, BC

Page 4: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

{depression, fatigue} 27 M 5

{stroke,

dementia,

fatigue}

97,72,67,80,… 73 90,92,96,89,86,… F 23

{arthritis} 102,99,87,96,… 49 97,100,82,80,70,…

M 14

… … … … … …

Input Data• Each instance: [Tabular | set | sequential] * attributesattr1 attr2 attr3 attr4 attr5 [class] illnesses heart rate age oxygen gender Epworth

P1

P2

P3

Page 5: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Analyzing Financial Data

• Sequential data – daily stock values

• “Normal” (tabular/relational) data

– sector (computers, agricultural, educational, …), type of government, product releases, companies awards, …

• Desired rules:– If DELL’s stock value increases & 1999<year<2002 =>

IBM’s stock value decreases

Page 6: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIAEvents – Financial Data

Basic events: 16 or so financial templates [Little&Rhodes78]

difficult pattern matching – alignments and time warping

Rounding Top Reversal Descending Triangle Reversal

Panic Reversal Head & Shoulders Reversal

Page 7: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIAWPI Weka

Tool for mining complex temporal/spatial associations

Page 8: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Data Mining for Genetic Analysisw/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI),

and Alvarez (CS, BC)

• SNP analysis– discovering correlations between

sequence variations and diseases

• Gene expression– discovering patterns that cause a gene

to be expressed in a particular cell

Page 9: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Correlating Genetics with Diseases

• Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research

• Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness.

Page 10: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Genomic Data Resources

Patient Patient GenderGender

SMA TypeSMA Type

(Severity)(Severity)

SNP SNP LocationLocation

C212 C212 Father / MotherFather / Mother

AG1-CAAG1-CAFather / MotherFather / Mother

FemaleFemale SevereSevere Y272CY272C 31 / 31 / 28 2928 29 102 / 102 / 108 112108 112

MaleMale MildMild Y272CY272C 28 2928 29 / 25 / 25 108 112108 112 / 114 / 114

Wirth, B. et al. Journal of Human Molecular Genetics

Page 11: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Our System: CAGE

To predict gene expression based on DNA sequences.

Muscle Cell

Neural Cell

Seam Cells

CAGE

Gene 1Gene 2

Gene 3

Gene 1Gene 2

Gene 3

Gene 3Gene 2

Gene 1On

Off

Page 12: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIAGene expression Analysis

ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC

ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA

TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA

GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC

ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA

CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA

GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA

AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA

CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT

PR1 PROMOTER(S) CELL TYPES

PR2

PR3

PR4

PR5

PR6

PR7

PR8

PR9

M1

M1

M1

M1

M1

M1

M2

M2

M2

M2

M4

M4

M4

M4

M4

M4

M4

M5

M5

M5

M5

M5

M5

M3

M3

M3

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Gene 9

neural

neural

muscle

neural

muscle

neural

neural

neural

muscle

Page 13: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

..CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGA

TRANSCRIPTIONAL PROTEINS

MUSCLE CELL

Gene Expression

GENE

• Transcription of DNA into RNA

PROMOTER REGION

TF 1

TF 2 TF 3

M1 M4 M2

240 100

MOTIFS M1, M2, M4

Page 14: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC

ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA

TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA

GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC

ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA

CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA

GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA

AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA

CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT

PR1 PROMOTER(S)

PR2

PR3

PR4

PR5

PR6

PR7

PR8

PR9

M1

M1

M1

M1

M1

M1

M2

M2

M2

M2

M4

M4

M4

M4

M4

M4

M4

M5

M5

M5

M5

M5

M5

M3

M3

M3

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Gene 9

neural

neural

muscle

neural

muscle

neural

neural

neural

muscle

R1: M1, M4, M5 => Neural

supp =22%, conf=100%

[Supp. instances: PR1, PR2]

R2: M2, M4, M5 => Neural

supp =22% , conf=100%

[Supp. instances: PR1,PR8]

Page 15: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA“Well-clustered” motifs

M1

M1

M1

M1

M1

M1

M2

M2

M2

M2

M4

M4

M4

M4

M4

M4

M4

M5

M5

M5

M5

M5

M5

M3

M3

M3

240240

190190

150150120120

210210 100100

21211818

6060

150150100100

360360

100100 350350

260260 210210

110110

Coefficient of variation of distances (cvd) between

two motifs:

),( MkMjIRncvd

),(

),(

MkMj

MkMj

IRn

IRn

IR1={M1,M2,M5}(M1,M2) = 120.1(M1,M2) = 216.6cvd(M1,M2) = 0.55

Page 16: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Distance-based Association Rules

• Given: – min-support

– min-confidence

– max-cvd

thresholds

• Mine:– all distance-based

association rules

Sample distance-based assoc. ruleR1: M1, M2, M5=>Neural (sup=33%, conf=100%)

M2 M5

cvd 0.554 0.076

mean 216.6 462.0

M1

sdev 120.1 35.0

cvd 0.433

mean 237.0

M2

sdev 103.0

Page 17: WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer

WPI Center for Research in Exploratory Data and Information Analysis CREDIAGrad. & Undergrad. Students

• Ali Benamara.• Dharmesh Thakkar.• Senthil K Palanisamy.• Zachary Stoecker-Sylvia.• Keith A. Pray. • Jonathan Freyberger.• Maged El-Sayed.• Parameshvyas

Laxminarayan. • Aleksandar Icev. • Wendy Kogel. • Michael Sao Pedro.• Christopher Shoemaker. • Weiyang Lin.

• Jonathan Rudolph• Eduardo Paredes• Iavor N. Trifonov. • Takeshi Kawato• Cindy Leung and Sam Holmes. • John Baird (BB), Jay Farmer, Rebecca Gougian (BB), Ken Monterio (BB),

Paul Young. • Zachary Stoecker-Sylvia.

Kristin Blitsch (BB), Ben Lucas, Sarah Towey(BB)• Wendy Kogel, Brooke LeClair, Christopher St. Yves. • Brian Murphy, David Phu (CS/BB), Ian Pushee, Frederick Tan (CS/BB). • Daniel Doyle, Jared Judecki, James Lund, Bryan Padovano (BB).• Christopher Cole. • Michael Ciman and John Gulbrandsen. • Tara Halwes • Christopher Martino. • Matthew Berube. • Anna Novikov. • Amy Kao and Dana Rock.