kddml: a middleware language and system for knowledge discovery in databases dipartimento di...
Post on 19-Dec-2015
220 Views
Preview:
TRANSCRIPT
KDDML: A Middleware Language and System for Knowledge Discovery in
Databases
Dipartimento di Informatica, Università di PisaA. Romei, S. Ruggieri, F. Turini
Thirteenth Italian Symposium onSistemi Evoluti per Basi di Dati (SEBD-2005)
Brixen, Italy – 19-22 June, 2005
SEBD 2005 - Brixen, June 2005
Application Area: KDD
Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying
valid,novel,potentially useful,understandable
patterns in data.
SEBD 2005 - Brixen, June 2005
The CRISP-DM processMain focus on automatic-phases:
Data pre-processingModelingPost-processingModel evaluation
SEBD 2005 - Brixen, June 2005
In this work
KDDML: an XML-based middleware language and system in support of the KDD process.
KDDML as language.KDDML as system.
SEBD 2005 - Brixen, June 2005
Requirements
R1: data/models repository should be available for storing input, output and intermediate objects of the KDD process.
Several representations of data can be available. Automatic format conversions.Automatic meta-data mapping (e.g., ARFF, SQL).
R2: specifying logical meta-data (meta-model) in addition to the physical data (model).R3: compositionality of mining operations in the design of the language (closure principle).R4: high extensibility of the system architecture.
SEBD 2005 - Brixen, June 2005
KDDML as XML-based System
XML as data/model representation (R1, R2).
Machine-processable language.
XML as language definition.Ensures compositionality of operators (R3).
Extensibility and modularity (R4).
SEBD 2005 - Brixen, June 2005
Data FormatSeparing the logical data from the physical instances.
Data schema via proprietary XML.Actual data stored in CSV (Comma Separated Values).
CSV has been chosen as a trade-off between readability (binary file) and space occupation (xml).
SEBD 2005 - Brixen, June 2005
Data Format: Example<KDDML_TABLE data_file=“census.csv”>
<SCHEMA logical_name=“census” number_of_attributes=“6” number_of_instances=“16”>
<ATTRIBUTE name=“age” number_of_missed_values=“0“ type=“numeric”>
<NUMERIC_DESCRIPTION mean=“40.75” variance=“237.8” min=“18.0” max=“70.0”/>
</ATTRIBUTE>
<ATTRIBUTE name=“education” number_of_missed_values=“3“ type=“nominal”>
<NOMINAL_DESCRIPTION number_of_values=“4”>
<VALUE value=“HS-grad” cardinality=“3”/>
<VALUE value=“masters” cardinality=“2”/>
….
</NOMINAL_DESCRIPTION>
</ATTRIBUTE>
….
</SCHEMA>
</KDDML_TABLE>
Logical Metadata
Physical Data
SEBD 2005 - Brixen, June 2005
Model Format
PMML (Predictive Model Markup Language)
An industry standard for actual models representation as XML documents.Consists of DTDs for a wide spectrum of models, including RdA, decision trees, clustering, regression, neural networks.It does not cover the process of extracting models, but the exchange of extracted knowledge.
SEBD 2005 - Brixen, June 2005
Model Format: Example<PMML version="2.0">
….
<DataDictionary>
<DataField name="id" optype="continuous" />
…
<DataField name="amount" optype="continuous" />
</DataDictionary>
<TreeModel modelName="censusTree" splitCharacteristic="multiSplit">
<MiningSchema>
<MiningField name="id" usageType="supplementary" />
…
<MiningField name="class" usageType="predicted" />
</MiningSchema>
<Node score="" recordCount="48842">
<True/>
<ScoreDistribution value="<=50K" recordCount ="37155" />
...
</Node>
</PMML>
Logical Metadata
Physical Model
SEBD 2005 - Brixen, June 2005
Closure Principle (1)
Arguments of an operator must be of an appropriate type and sequence.We denote the signature of an operator op:t1 x … x tn t by defining a DTD for KDDML queries that constraints sub-elements to be of type t1, … , tn.
SEBD 2005 - Brixen, June 2005
Closure Principle (2)
Where:kdd_query_trees: all operators returning a classification tree;kdd_query_table: all operators returning a table;TREE_CLASSIFY belongs to the kdd_query_table entity.
fTREE_CLASSIFY: tree x table table
<!ELEMENT TREE_CLASSIFY ((%kdd_query_trees;), (%kdd_query_table;))><!ATTLIST TREE_CLASSIFY xml_dest %string; #IMPLIED>
SEBD 2005 - Brixen, June 2005
KDDML Types
The set of types of KDDML operators consists of:
Table, PPtableTree, clusters, rda, sequence, hierarchyAlgs, condition, expression
SEBD 2005 - Brixen, June 2005
KDDML Query structure
The structure of a KDDML query has a precise format.
XML tags element correspond to operation on data and models;XML attributes correspond to parameters of those operationsXML sub-elements define the arguments passed to the operators (KDDML Types).
<OPERATOR_NAME xml_dest="results.xml" att1="v1" ... attM="vM"> <ARG1_NAME> .... </ARG1_NAME> ... <ARGn_NAME> .... </ARGn_NAME></OPERATOR_NAME>
SEBD 2005 - Brixen, June 2005
Example (1)
Construction and application of a decision tree.
Loading of an ARFF source as training set.Simple sampling on training set.Construction of a decision tree on sampled training set.
Target attribute: play.Algorithm: C4.5.
Loading of a test set from the system repository.Application of the decision tree on the test set.
SEBD 2005 - Brixen, June 2005
Example (2)
<KDDML_OBJECT> <KDD_QUERY name="sample"> <TREE_CLASSIFY xml_dest="results.xml"> <TREE_MINER xml_dest="weather.xml" target_attribute="play"> <PP_SAMPLING> <ARFF_LOADER arff_file_name="weather.arff"/> <ALGORITHM algorithm_name=“simple_sampling”> <PARAM name=“percentage” value=“0.66”/> </ALGORITHM> </PP_SAMPLING> <ALGORITHM algorithm_name=“C4.5"> <PARAM name="confidence_for_pruning" value="0.4"/> <PARAM name="num_instances_for_leaf" value="6"/> </ALGORITHM> </TREE_MINER> <TABLE_LOADER xml_source="weather_test.xml"/> </TREE_CLASSIFY> </KDD_QUERY></KDDML_OBJECT>
<TREE_CLASSIFY xml_dest="results.xml"> <TREE_MINER ....> .... </TREE_MINER> <TABLE_LOADER xml_source="weather_test.xml"/></TREE_CLASSIFY>
...<TREE_MINER xml_dest="weather.xml" target_attribute="play"> <PP_SAMPLING> ..... </PP_SAMPLING> <ALGORITHM algorithm_name=“c4.5"> <PARAM name="confidence_for_pruning" value="0.4"/> <PARAM name="num_instances_for_leaf" value="6"/> </ALGORITHM></TREE_MINER>...
...<PP_SAMPLING> <ARFF_LOADER .../> <ALGORITHM algorithm_name=“simple_sampling”> <PARAM name=“percentage” value=“0.66”/> </ALGORITHM></PP_SAMPLING>...
...<TABLE_LOADER xml_source="weather_test.xml"/>...
...<ARFF_LOADER arff_file_name="weather.arff"/>...
RepositoryData
Table LoaderSource: weather_test.xml
Tree Classify
Tree MinerAlg: c4.5
Pruning confidence: 40%Num instances: 6
SamplingAlg: simple sampling
Percentage: 66%
Arff LoaderSource: weather.arff
RepositoryARFF
SEBD 2005 - Brixen, June 2005
Language Operators
Data/Model access.Preprocessing.
Data Cleaning, Sampling, Normalization, Discretization.
Model Extraction.Model application and evaluation.Model meta-reasoning & filtering.
SEBD 2005 - Brixen, June 2005
Example one: Discretization
....<PP_NUMERIC_DISCRETIZATION xml_dest= "census_discrete.xml", attribute_name = "age", label_type = "enumeration", enumerated_label_list = "young, middle, old"> <TABLE_LOADER xml_source= "census.xml"/> <ALGORITHM algorithm_name="natural_binning"> <PARAM name="cardinality" value="3"/> <PARAM name="having_number_of_intervals" value="true"/> </ALGORITHM></PP_NUMERIC_DISCRETIZATION>....
Discretization of a numeric attribute “age” into three intervals using the natural binning method.
SEBD 2005 - Brixen, June 2005
Example two: RdA filtering
....<RDA_FILTER> <RDA_LOADER xml_source="rules.xml"/> <CONDITION> <AND_COND> <BASE_COND op_type="is_in" term1="@body" term2="bread"/> <BASE_COND op_type="is_not_in" term1="@head" term2="milk"/> <BASE_COND op_type="equal" term1="@head_cardinality" term2="2"/> <BASE_COND op_type="greater" term1="@support" term2="0.3"/> </AND_COND> </CONDITION></RDA_FILTER>....
Selects the rules with item “bread” in the body and not having the item “milk” in the head and having exactly two items in the head and having the support greater than 30%.
SEBD 2005 - Brixen, June 2005
Design targets
ExtensibilityData sourcesAlgorithmsModels
PortabilityModularity.
Architecture structured in 3 layers.
SEBD 2005 - Brixen, June 2005
Architecture Layers
RepositoryLayer
OperatorsLayer
InterpreterLayer
To upper layers…
Data Models
Operators Layer:
• Implementation of language operators.
• <OPERATOR_NAME> is implemented as a Java class satisfying an interface.
• Interface is task-dependent.
Repository Layer:
• Manages the read/write access to data and models repository.
• Manages the read/write access to data and models from external sources.
• Give a programmatic functionality to the higher layers.
Interpreter Layer:
• Accepts a validated KDDML query and returns the result as XML document.
• Recursively traverse the DOM tree representation.
• The interpreter is not-affected by data/algorithms/model extensibility.
SEBD 2005 - Brixen, June 2005
KDDML as Middleware System
Compiler
Query MQL
Query KDDML
ResultsResults
RepositoryLayer
OperatorsLayer
InterpreterLayer
Data Models
MQLHigh Level
GUI
Query KDDML
SEBD 2005 - Brixen, June 2005
ClickWorldExtract DM models from visits to a city-news portal with the intent to characterize topics-of-interest of new visitors.
M. Baglioni, U. Ferrara, A. Romei, S. Ruggieri, F. Turini Preprocessing and mining web log data for web personalization. 8th Italian Conf. on Artificial Intelligence : 237-249. Vol. 2829 of LNCS, September 2003.
SEBD 2005 - Brixen, June 2005
KDDML-G
OP1
OP
OP2 OP3
A system for KDD on the GRID.Exploit the parallelism offered by the GRID Data immovability by moving the code on the place.
top related