towards new models and languages for data mining and integration
DESCRIPTION
Towards New Models and Languages for Data Mining and Integration. Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation at the NeSC, Edinburgh August 13, 2008. Outline. Introduction CRISP-DM Model and Methodology What is CRISP-DM Why update it - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/1.jpg)
Towards New Models and Languages for Data Mining and Integration
Peter Brezany
Institute of Scientific ComputingUniversity of Vienna, Austria
Presentation at the NeSC, EdinburghAugust 13, 2008
![Page 2: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/2.jpg)
Edinburgh, 13 Aug, 2008 2
Outline
Introduction CRISP-DM Model and Methodology
What is CRISP-DM Why update it From CRISP-DM to CRISP-DMI
Impact of CRISP-DMI on the DMI Workflow Language
State of the Art in Language Design Discussion of the 1st Language Design Ideas Conclusions and Future Work
![Page 3: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/3.jpg)
Edinburgh, 13 Aug, 2008 3
What is CRISP-DM?
Phases of the CRoss Industry Standard Process for Data Mining
![Page 4: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/4.jpg)
Edinburgh, 13 Aug, 2008 4
CRISP-DM Phases
Business Understanding: the process of understanding the project objectives from a business perspective
Data Understanding: the process of collecting and becoming familiar with data
Data Preparation: the process of selecting and cleansing the data that will be fed into the modeling tools
Modeling: the process of applying modeling to manipulate the data so that conclusions can be drawn
Evaluation: the process of evaluating the model and its conclusions
Deployment: the process of applying the conclusions to a business
![Page 5: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/5.jpg)
Edinburgh, 13 Aug, 2008 5
Why to Update CRISP-DM?
Support for large-scale data mining a lot of distributed, heterogeneous and large
datasets (primary data, derived data, background data, catalogs): from data to “space of data”
data integration is of great importance new actors (domain expert, data analyst, data
publisher, system administrator) support by new components (e.g. provenance) etc.
Our approach: from CRISP-DM to CRISP-DMI (Cross Research & Industry Standard Process for Data Mining and Integration )
![Page 6: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/6.jpg)
Edinburgh, 13 Aug, 2008 6
CRISP-DMI Model
![Page 7: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/7.jpg)
Edinburgh, 13 Aug, 2008 7
Space of Data and Services
Author: Ibrahim Elsayed
![Page 8: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/8.jpg)
Edinburgh, 13 Aug, 2008 8
TCM Workflow
![Page 9: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/9.jpg)
Edinburgh, 13 Aug, 2008 9
Subworkflow Targeted by Provenance
![Page 10: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/10.jpg)
Edinburgh, 13 Aug, 2008 10
Visualization of Provenance Data
Authors: Y. Han & F.A. Khan
![Page 11: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/11.jpg)
Edinburgh, 13 Aug, 2008 11
Use case
The fields in the data are:
Age: Sex: M or F BP: Blood Pressure-High, Normal, or Low Cholesterol: Blood Cholesterol Level-Normal or High Na: Blood sodium concentration K: Blood potassium concentration Drug: The drug to which this patient responded
The business question: Can we find which drug is appropriate for anyfuture patient?
(from P. Caron, C. Shearer, Interactive Visual Workflow: The Key to Streamlining the Data Mining Process)
![Page 12: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/12.jpg)
Edinburgh, 13 Aug, 2008 12
DmiFlow: DMI Workflow Language
The emerging DMI applications lead to the demand of a powerful DMI workflow language
On top of it interactive GUIs can be developed
It should enable optimized implementation of language processors
![Page 13: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/13.jpg)
Edinburgh, 13 Aug, 2008 13
DMI Process to be Composed by DmiFlow
Space of Source and Destination Data and Services
Space of Source and Destination Data and Services
DMIProcess
Com
posit
ion
![Page 14: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/14.jpg)
Edinburgh, 13 Aug, 2008 14
A Possible Position of DmiFlow in the Workflow Management Systems
Tex
tual
rep
rese
nta
tio
n
Sys
tem
su
pp
ort
BPEL orother language
Su
b-w
ork
lfo
w f
or
e 2
Sys
tem
su
pp
ort
e21
e23
e22
e24
UML
Hig
h-l
evel
wo
rkfl
ow
co
mp
osi
tio
n
Sys
tem
su
pp
ort
Feedback
User-relevant information flow
System-relevant information flow
e1
e2
e3
User
Visualisation
UML
![Page 15: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/15.jpg)
Edinburgh, 13 Aug, 2008 15
Principles for DMI Language Design
Programmer Responsibilities Identification of Parallelism Specifying communication mode between
workflow components Providing hints (sometimes based on domain
knowledge) enabling advanced optimization Language Desiderata
High abstraction level, not too complex (high productivity)
Advanced compositional features Execution of data mining queries (support for the
inductive database model) Extendibility Efficient implementation (high performance)
![Page 16: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/16.jpg)
Edinburgh, 13 Aug, 2008 16
Related Work
Low-level workflow notations: XML-based: BPEL4WS, DSCL, WSFL, etc. Other: Sculf (Taverna), MoML (Kepler), etc.
High-level languages (only for workflows integrating business processes): Workflow Prolog Valmont: It includes, process model, information
model, and organization model (It registers organizational structure and resources.)
C & Co: a C based language F#: functional workflow specification at a script
level (MicroSoft development) Martlet: functional workflow specification
Compositional languages (Strand, PCN, etc.)
![Page 17: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/17.jpg)
Edinburgh, 13 Aug, 2008 17
Workplan for the Language Design
Phase 1 (ongoing): proposing semantic structure and outlining compositional structure of programs while leaving open some aspects of their concrete representations as strings of symbols.
Phase 2: finalizing the 1st language definition version.
![Page 18: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/18.jpg)
Edinburgh, 13 Aug, 2008 18
Basic Features of DmiFlow
Code modules – managing complexity Activities: their types, parameters, locations Virtual communication channels between
activities, which can be represented by Persistent explicit datasets Internal datasets (implementation dependent) Ports used for streaming data
Control structures: parallel & sequential statements, loop statements, conditional statements)
Embedded data mining query execution
![Page 19: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/19.jpg)
Edinburgh, 13 Aug, 2008 19
Declaration of Activities and Datasets
activity activity_name: ActivityType at (activity_location);
ActivityType – predefined (type of parameters and semantics)
activity_location ∊ {url, discover, default} this is optional
dataset dataset_name represents (source = source_spec, hints_list);
source_spec ∊ {url, internal, port}
hint ∊ {org = dataset_organization, size = estimated_size, …}
dataset_organization ∊ {set, sequence, bag, …}
![Page 20: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/20.jpg)
Edinburgh, 13 Aug, 2008 20
Basic Control Structures
Concurrent execution:
cobegin { activity1(…); … activityn(…);}
Sequential execution:
block { activity1(…); … activityn(…);}
Data mining query execution:
exec dmq (arguments) byactivity (activity_name){ dmq_query_specification}
![Page 21: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/21.jpg)
Edinburgh, 13 Aug, 2008 21
Workflow Example – Graphical Form
![Page 22: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/22.jpg)
Edinburgh, 13 Aug, 2008 22
DmiFlow Example (1)
module WorkflowExample {
const replaceMethod = "average", splitingMethod = "gini", //hint url1 = "/serverA/dmi/services/integrationService1", url2 = "/serverB/dmi/services/decisionTreeService1", url3 = "/serverB/dmi/services/neuralNetworkService3";
activity integrDS: dataIntegrationActType at (url1), missVals: MissingValuesActType at (discover), normalise: NormalisForNNActType at (default), dt: decisionTreeActType at (url2), nn:NeuralNetworkActType at (url3);
dataset ….
![Page 23: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/23.jpg)
Edinburgh, 13 Aug, 2008 23
DmiFlow Example (2)
dataset ds1 represents (source = "http://www.myproject/d1.dat", org = set, size = [1.5, 2.0]), ds2 represents (source = "http://www.myproject/d2.dat", type = set), intConf represents (source = "/server/dmi/config/integr.conf); outIntegr represents (source = internal, org = set), cleaned represents (source = internal, org = set); normalised represents (source = internal, org = set); nnConf represents (source = "/server/dmi/configs/nn.conf); nnMod represents (source = "/server/dmi/models/nn.pmml); dtMod represents (source = "/server/dmi/models/dt.pmml);
defworkflow { . . . }
![Page 24: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/24.jpg)
Edinburgh, 13 Aug, 2008 24
DmiFlow Example (3)
defworkflow main () { integrDSets (in ds1, ds2, intConf; out outItegr); missValues (in outIntegr, replaceMethod; out cleaned); cobegin { block { normalise (in cleaned; out normalised); nn (in normalised, nnConf; out nnMod); } dt (in cleaned, splittingMethod; out dtMod); } }
![Page 25: Towards New Models and Languages for Data Mining and Integration](https://reader036.vdocument.in/reader036/viewer/2022062309/56814e90550346895dbc385f/html5/thumbnails/25.jpg)
Edinburgh, 13 Aug, 2008 25
Future Work
Extend language functionality Investigate DmiFlow execution model
for the ADMIRE architecture Define functional specification of the
DmiFlow language processor Specify concrete language syntax