weka4ws : enabling distributed data mining on grids

47
Venezia, October 17-18, 2005 1 Weka4WS: Enabling Distributed Data Mining on Grids Domenico Talia , Paolo Trunfio, Oreste Verta [email protected] DEIS University of Calabria Rende, Italy

Upload: others

Post on 03-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Weka4WS : Enabling Distributed Data Mining on Grids

Venezia, October 17-18, 2005

1

Weka4WS: Enabling Distributed Data

Mining on Grids

Domenico Talia, Paolo Trunfio, Oreste [email protected]

DEIS

University of Calabria

Rende, Italy

Page 2: Weka4WS : Enabling Distributed Data Mining on Grids

2

Outline

� Goal of this work

� Weka4WS, WSRF and GT4

� Weka4WS architecture

� Software components

� Graphical user interface

� Web Services operations

� Execution mechanisms

� Performance analysis

� Conclusions

Page 3: Weka4WS : Enabling Distributed Data Mining on Grids

3

Goal of this work

� Computational Grids emerged in the last years as

effective infrastructures for distributed high-performance

computing and data processing

� By exploiting a service-oriented approach, knowledge

discovery applications can be developed on Grids to

deliver high performance and manage data and

knowledge distribution

� The goal of this work is to extend the Weka toolkit for

supporting distributed data mining through the use of

standard Grid technologies, such as the emerging Web

Services Resource Framework (WSRF) and the Globus

Toolkit 4

Page 4: Weka4WS : Enabling Distributed Data Mining on Grids

4

The Weka4WS framework

� Weka provides a large collection of machine learning

algorithms for data pre-processing, classification,

clustering, association rules, and visualization, which can

be used through a common GUI

� In Weka, the overall data mining process takes place on

a single machine, since the algorithms can be executed

only locally

� The goal of Weka4WS is to extend Weka to support

remote execution of Weka data mining algorithms

Page 5: Weka4WS : Enabling Distributed Data Mining on Grids

5

The Weka4WS framework

� In Weka4WS the data mining algorithms for

classification, clustering and association rules can be

executed on remote Grid resources to implement

distributed data mining and speedup applications.

� To enable remote invocation, each data mining

algorithm provided by the Weka library are

exposed as a Web Service, which can be easily

deployed on the available Grid nodes

� To achieve integration and interoperability with standard

Grid environments, Weka4WS has been designed and

developed by using the emerging Web Services Resource

Framework (WSRF) as enabling technology

Page 6: Weka4WS : Enabling Distributed Data Mining on Grids

6

WSRF and Globus Toolkit 4

� WSRF is a family of technical specifications concerned

with the creation, addressing, inspection, and lifetime

management of stateful resources

� The framework codifies the relationship between Web

Services and stateful resources, which are termed as

WS-Resource in the WSRF language

� The Globus Alliance recently released the Globus Toolkit

4 (GT4), which provides an open source implementation

of the WSRF library and incorporates Web Services

implemented according to the WSRF specifications

� Weka4WS has been developed by using the Java WSRF

library provided by GT4

Page 7: Weka4WS : Enabling Distributed Data Mining on Grids

7

Outline

� Goal of this work

� Weka4WS, WSRF and GT4

� Weka4WS architecture

� Software components

� Graphical user interface

� Web Services operations

� Execution mechanisms

� Performance analysis

� Conclusions

Page 8: Weka4WS : Enabling Distributed Data Mining on Grids

8

Weka4WS architecture

� In Weka4WS all nodes use GT4 services for standard Grid

functionalities, such as security and data management

� We distinguish those nodes in two categories:

� user nodes, which are the local machines of the users

providing the Weka4WS client software

� computing nodes, which provide the Weka4WS Web

Services allowing the execution of remote data mining tasks

� Data can be located on computing nodes, user nodes, or

third-party nodes

� If the dataset to be mined is not locally available on a

computing node, it can be downloaded or replicated by

means of the GT4 data management services

Page 9: Weka4WS : Enabling Distributed Data Mining on Grids

9

Software components

� User nodes include three software components:

� Graphical User Interface (GUI)

� Client Module (CM)

� Weka Library (WL)

GraphicalUser Interface

WekaLibrary

User node

ClientModule

GT4 Services

WebService

WekaLibrary

GT4 Services

Computing node

Page 10: Weka4WS : Enabling Distributed Data Mining on Grids

10

Software components

� Computing nodes include two software components:

� Web Service (WS)

� Weka Library (WL)

GraphicalUser Interface

WekaLibrary

User node

ClientModule

GT4 Services

WebService

WekaLibrary

GT4 Services

Computing node

Page 11: Weka4WS : Enabling Distributed Data Mining on Grids

11

Software components

� The GUI extends the Weka Explorer environment to allow

the execution of both local and remote data mining tasks:

� local tasks are executed by directly invoking the local WL

� remote tasks are executed through the CM, which operates

as an intermediary between the GUI and Web Services on

remote computing nodes

GraphicalUser Interface

WekaLibrary

User node

ClientModule

GT4 Services

WebService

WekaLibrary

GT4 Services

Computing node

local task

remotetask

Page 12: Weka4WS : Enabling Distributed Data Mining on Grids

12

Software components

� The WS is a WSRF-compliant Web Service that exposes

the data mining algorithms provided by the underlying

WL

� Therefore, requests to the WS are executed by invoking

the corresponding WL algorithms

GraphicalUser Interface

WekaLibrary

User node

ClientModule

GT4 Services

WebService

WekaLibrary

GT4 Services

Computing node

Algorithminvocation

Page 13: Weka4WS : Enabling Distributed Data Mining on Grids

13

Graphical user interface

� A “Remote pane” has been added to the original Weka

Explorer environment

Page 14: Weka4WS : Enabling Distributed Data Mining on Grids

14

Graphical user interface

� This pane provides a list of the remote Web Services that

can be invoked, and two buttons to start and stop the

data mining task on the selected Web Service

Page 15: Weka4WS : Enabling Distributed Data Mining on Grids

15

Graphical user interface

� Through the GUI a user can both:

� start the execution locally by using the standard Local pane

� start the execution remotely by using the Remote pane

� Each task in the GUI is managed by an independent

thread

� Therefore, a user can start multiple data mining tasks in

parallel on different Web Services, this way taking full

advantage of the distributed Grid environment

� Whenever the output of a data mining task has been

received from a remote computing node, it is visualized

in the standard Output pane

Page 16: Weka4WS : Enabling Distributed Data Mining on Grids

16

Outline

� Goal of this work

� Weka4WS, WSRF and GT4

� Weka4WS architecture

� Software components

� Graphical user interface

� Web Services operations

� Execution mechanisms

� Performance analysis

� Conclusions

Page 17: Weka4WS : Enabling Distributed Data Mining on Grids

17

Web Services operations

� The first three operations are related to WSRF-specific

invocation mechanisms

� The last three operations are used to require the

execution of a specific data mining task

Subscribes to notifications about resource

properties changes.subscribe

Submits the execution of an association rules

task.associationRules

Submits the execution of a clustering task.clustering

Submits the execution of a classification task.classification

Explicitly requests the destruction of a WS-

Resource.destroy

Creates a new WS-Resource.createResource

WSRFspecific

Datamining

Page 18: Weka4WS : Enabling Distributed Data Mining on Grids

18

Web Services operations

� The classification operation provides access to the

complete set of classifiers in the Weka Library (currently,

71 algorithms)

� The clustering and associationRules operations expose all

the clustering and association rules algorithms provided

by the Weka Library (5 and 2 algorithms, respectively)

� To improve concurrency the data mining operations are

invoked in an asynchronous way:

� the client submits the execution in a non-blocking mode,

and results are notified to the client whenever they have

been computed

Page 19: Weka4WS : Enabling Distributed Data Mining on Grids

19

Input parameters of the DM operations

� Three parameters are required in the invocation of all

the data mining operations: algorithm, arguments, and

dataSet

URL of the dataset to be mineddataSet

Algorithm argumentsarguments

Name of the association rules algorithmalgorithm

associationRules

URL of the dataset to be mineddataSet

Index of the class w.r.t. evaluate clustersclassIndex

Indexes of the selected attributesselectedAttrs

Testing phase optionstestOptions

Algorithm argumentsarguments

Name of the clustering algorithmalgorithm

clustering

URL of the dataset to be mineddataSet

Index of the attribute to use as the classclassIndex

Options to be used during the testing phasetestOptions

Arguments to be passed to the algorithmarguments

Name of the classification algorithmalgorithm

classification

Page 20: Weka4WS : Enabling Distributed Data Mining on Grids

20

Input parameters of the DM operations

� The algorithm parameter specifies the name of the Java

class in the Weka Library to be invoked

� example: “weka.classifiers.trees.J48”

URL of the dataset to be mineddataSet

Algorithm argumentsarguments

Name of the association rules algorithmalgorithm

associationRules

URL of the dataset to be mineddataSet

Index of the class w.r.t. evaluate clustersclassIndex

Indexes of the selected attributesselectedAttrs

Testing phase optionstestOptions

Algorithm argumentsarguments

Name of the clustering algorithmalgorithm

clustering

URL of the dataset to be mineddataSet

Index of the attribute to use as the classclassIndex

Options to be used during the testing phasetestOptions

Arguments to be passed to the algorithmarguments

Name of the classification algorithmalgorithm

classification

Page 21: Weka4WS : Enabling Distributed Data Mining on Grids

21

Input parameters of the DM operations

� The arguments parameter specifies a sequence of

arguments to be passed to the algorithm

� example: “-C 0.25 -M 2”

URL of the dataset to be mineddataSet

Algorithm argumentsarguments

Name of the association rules algorithmalgorithm

associationRules

URL of the dataset to be mineddataSet

Index of the class w.r.t. evaluate clustersclassIndex

Indexes of the selected attributesselectedAttrs

Testing phase optionstestOptions

Algorithm argumentsarguments

Name of the clustering algorithmalgorithm

clustering

URL of the dataset to be mineddataSet

Index of the attribute to use as the classclassIndex

Options to be used during the testing phasetestOptions

Arguments to be passed to the algorithmarguments

Name of the classification algorithmalgorithm

classification

Page 22: Weka4WS : Enabling Distributed Data Mining on Grids

22

Input parameters of the DM operations

� The dataSet parameter specifies the URL of the dataset

to be mined

� example: “gsiftp://hostname/path/file.arff”

URL of the dataset to be mineddataSet

Algorithm argumentsarguments

Name of the association rules algorithmalgorithm

associationRules

URL of the dataset to be mineddataSet

Index of the class w.r.t. evaluate clustersclassIndex

Indexes of the selected attributesselectedAttrs

Testing phase optionstestOptions

Algorithm argumentsarguments

Name of the clustering algorithmalgorithm

clustering

URL of the dataset to be mineddataSet

Index of the attribute to use as the classclassIndex

Options to be used during the testing phasetestOptions

Arguments to be passed to the algorithmarguments

Name of the classification algorithmalgorithm

classification

Page 23: Weka4WS : Enabling Distributed Data Mining on Grids

23

Task execution steps

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

DatasetWeka

Library

GT4 RFT Service

User nodeComputing node

GridFTP ServerGridFTP Server

� Scenario: the Client Module (CM) is requesting the execution

of a clustering analysis on a dataset local to the user node

� To perform this task a sequence of steps need to be executed

Page 24: Weka4WS : Enabling Distributed Data Mining on Grids

24

Task execution: Resource creation

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1 1

WekaLibrary

GT4 RFT Service

User nodeComputing node

GridFTP ServerGridFTP Server

Resource creation. The CM invokes the createResource

operation to create a new WS-Resource. A clustering model

property is used to store the result of the clustering task. The

WS returns the EPR of the created resource

1

Page 25: Weka4WS : Enabling Distributed Data Mining on Grids

25

Task execution: Notification subscription

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1 1

2 2

WekaLibrary

GT4 RFT Service

User nodeComputing node

GridFTP ServerGridFTP Server

Notification subscription. The CM invokes the subscribe

operation, which subscribes to notifications about changes

that will occur to the clustering model resource property

2

Page 26: Weka4WS : Enabling Distributed Data Mining on Grids

26

Task execution: Task submission

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1 1

2 2

3 3

WekaLibrary

GT4 RFT Service

User nodeComputing node

GridFTP ServerGridFTP Server

Task submission. The CM invokes the clustering operation

to require the execution of the clustering task. The operation

is invoked in an asynchronous way

3

Page 27: Weka4WS : Enabling Distributed Data Mining on Grids

27

Task execution: Dataset download

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1

Dataset

1

2 2

3 3

WekaLibrary

GT4 RFT Service

User nodeComputing node

44

4GridFTP ServerGridFTP Server

Dataset download. The WS requests to download the

dataset to be mined from the URL specified in the clustering

invocation. The download request is managed by the GT4

Reliable File Transfer (RFT) service

4

Page 28: Weka4WS : Enabling Distributed Data Mining on Grids

28

Task execution: Data mining

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1

Dataset

1

2 2

3 3

WekaLibrary

GT4 RFT Service

User nodeComputing node

44

55

4GridFTP ServerGridFTP Server

Data mining. The clustering analysis is started by invoking

the appropriate Java class in the WL. The execution is handled

within the WS-Resource created on Step 1, and the result of

the computation is stored in the clustering model property

5

Page 29: Weka4WS : Enabling Distributed Data Mining on Grids

29

Task execution: Results notification

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1

Dataset

1

2 2

3 3

WekaLibrary

6

GT4 RFT Service

User nodeComputing node

44

55

4GridFTP ServerGridFTP Server

Results notification. Whenever the clustering model

property has been changed, its new value is notified to the

CM, by invoking its implicit deliver operation

6

Page 30: Weka4WS : Enabling Distributed Data Mining on Grids

30

Task execution: Resource destruction

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1

Dataset

1

2 2

3 3

WekaLibrary

6

7 7

GT4 RFT Service

User nodeComputing node

44

55

4GridFTP ServerGridFTP Server

Resource destruction. The CM invokes the destroy

operation, which explicitly destroys the WS-Resource created

on Step 1

7

Page 31: Weka4WS : Enabling Distributed Data Mining on Grids

31

Outline

� Goal of this work

� Weka4WS, WSRF and GT4

� Weka4WS architecture

� Software components

� Graphical user interface

� Web Services operations

� Execution mechanisms

� Performance analysis

� Conclusions

Page 32: Weka4WS : Enabling Distributed Data Mining on Grids

32

Performance analysis

� Goals:

� evaluating the execution times of the different steps needed to

perform a typical data mining task in different network scenarios

� evaluating the efficiency of the WSRF mechanisms and Weka4WS

as methods to execute distributed data mining services

� We used 10 datasets extracted from the census dataset

available at the UCI repository:

� number of instances: from 1700 to 17000

� dataset size: from 0.5 to 5 MB

� Weka4WS has been used to perform a clustering analysis

on each of these datasets:

� algorithm used: Expectation Maximization (EM)

� number of clusters to be identified: 10

Page 33: Weka4WS : Enabling Distributed Data Mining on Grids

33

Performance analysis

� The clustering analysis on each dataset was executed in

two network scenarios:

� LAG: the computing node and the user node are connected

by a local area grid (Bw = 94.4 Mbps, RTT = 1.4 ms)

� WAG: the computing node and the user node are

connected by a wide area grid (Bw = 213 kbps, RTT = 19

ms)

� For each dataset size and network scenario we run 20

independent executions:

� the values reported in the following graphs are computed

as an average of the values measured in the 20 executions

Page 34: Weka4WS : Enabling Distributed Data Mining on Grids

34

Execution times - LAG scenario

1.1

2E

+05

2.1

2E

+05

3.1

1E

+05

4.1

6E

+05

5.1

9E

+05

6.2

5E

+05

7.2

6E

+05

8.3

0E

+05

9.2

4E

+05

1.0

3E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation

Notification subscription

Task submission

Dataset download

Data mining

Results notification

Resource destruction

Total

Dataset size (MB)

Execution time (ms)

� This graph represents the execution times of the different

steps of the clustering task in the LAG scenario for a dataset

size ranging from 0.5 to 5 MB

Page 35: Weka4WS : Enabling Distributed Data Mining on Grids

35

Execution times - LAG scenario

� The execution times of the WSRF-specific steps are

independent from the dataset size:

� resource creation (1698 ms, on the average), notification

subscription (275 ms), task submission (342 ms), results

notification (1354 ms), and resource destruction (214 ms)

1.1

2E

+05

2.1

2E

+05

3.1

1E

+05

4.1

6E

+05

5.1

9E

+05

6.2

5E

+05

7.2

6E

+05

8.3

0E

+05

9.2

4E

+05

1.0

3E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation

Notification subscription

Task submission

Dataset download

Data mining

Results notification

Resource destruction

Total

Dataset size (MB)

Execution time (ms)

Page 36: Weka4WS : Enabling Distributed Data Mining on Grids

36

Execution times - LAG scenario

� On the contrary, the execution times of the dataset download

and data mining steps are proportional to the dataset size:

� Dataset download: 218 ms (0.5 MB) ... 665 ms (5 MB)

� Data mining: 107474 ms (0.5 MB) ... 1026584 ms (5 MB)

1.1

2E

+05

2.1

2E

+05

3.1

1E

+05

4.1

6E

+05

5.1

9E

+05

6.2

5E

+05

7.2

6E

+05

8.3

0E

+05

9.2

4E

+05

1.0

3E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation

Notification subscription

Task submission

Dataset download

Data mining

Results notification

Resource destruction

Total

Dataset size (MB)

Execution time (ms)

dataset download

data mining

Page 37: Weka4WS : Enabling Distributed Data Mining on Grids

37

Execution times - LAG scenario

� The total execution time ranges from 111798 ms for the

dataset of 0.5 MB, to 1031209 ms for the dataset of 5 MB

� The lines representing the total execution time and the data

mining execution time appear coincident, because the data mining

step takes from 96% to 99% of the total execution time

1.1

2E

+05

2.1

2E

+05

3.1

1E

+05

4.1

6E

+05

5.1

9E

+05

6.2

5E

+05

7.2

6E

+05

8.3

0E

+05

9.2

4E

+05

1.0

3E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation

Notification subscription

Task submission

Dataset download

Data mining

Results notification

Resource destruction

Total

Dataset size (MB)

Execution time (ms)

total time

Page 38: Weka4WS : Enabling Distributed Data Mining on Grids

38

Execution times - WAG scenario

� The execution times of the WSRF-specific steps are similar to

those measured in the LAG scenario

� The only significant difference is the execution time of the results

notification step (2790 ms), due to additional time needed to

transfer the clustering model through a low-speed network

1.3

0E

+05

2.4

6E

+05

3.6

1E

+05

4.8

6E

+05

6.1

6E

+05

7.2

5E

+05

8.4

9E

+05

9.6

4E

+05

1.0

6E

+06

1.1

8E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation

Notification subscription

Task submission

Dataset download

Data mining

Results notification

Resource destruction

Total

Dataset size (MB)

Execution time (ms)

Page 39: Weka4WS : Enabling Distributed Data Mining on Grids

39

Execution times - WAG scenario

� For the same reason, the transfer of the dataset to be mined

requires an execution time significantly greater than the one

measured in the LAG scenario:

� Dataset download: 14638 ms (0.5 MB) ... 132463 ms (5 MB)

1.3

0E

+05

2.4

6E

+05

3.6

1E

+05

4.8

6E

+05

6.1

6E

+05

7.2

5E

+05

8.4

9E

+05

9.6

4E

+05

1.0

6E

+06

1.1

8E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation

Notification subscription

Task submission

Dataset download

Data mining

Results notification

Resource destruction

Total

Dataset size (MB)

Execution time (ms)

Page 40: Weka4WS : Enabling Distributed Data Mining on Grids

40

Execution times percentage - LAG scenario

� This graph shows the percentage of the execution times of the

data mining, dataset download, and the other steps (i.e.,

resource creation, notification subscription, task submission,

results notification, resource destruction), w.r.t. the total

execution time in the LAG scenario

96.1

3%

98.0

2%

98.7

1%

99.0

2%

99.2

1%

99.3

2%

99.4

0%

99.4

8%

99.5

2%

99.5

5%

0

10

20

30

40

50

60

70

80

90

100

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

% o

f th

e t

ota

l execution t

ime

Other steps

Dataset download

Data mining

Dataset size (MB)

% of the total execution time

Page 41: Weka4WS : Enabling Distributed Data Mining on Grids

41

Execution times percentage - LAG scenario

� In the LAG scenario the data mining step represents from

96.13% to 99.55% of the total execution time, the dataset

download ranges from 0.19% to 0.06%, and the other steps

range from 3.67% to 0.38%

96.1

3%

98.0

2%

98.7

1%

99.0

2%

99.2

1%

99.3

2%

99.4

0%

99.4

8%

99.5

2%

99.5

5%

0

10

20

30

40

50

60

70

80

90

100

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

% o

f th

e t

ota

l execution t

ime

Other steps

Dataset download

Data mining

Dataset size (MB)

% of the total execution time

Page 42: Weka4WS : Enabling Distributed Data Mining on Grids

42

Execution times percentage - WAG scenario

� In the WAG scenario the data mining step represents from

84.62% to 88.32% of the total execution time, the dataset

download ranges from 11.22% to 11.20%, while the other

steps range from 4.16% to 0.48%

88.3

2%

88.2

5%

88.3

2%

88.3

1%

88.1

9%

88.1

0%

87.5

6%

87.2

9%

86.4

0%

84.6

2%

0

10

20

30

40

50

60

70

80

90

100

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

% o

f th

e t

ota

l execution t

ime

Other steps

Dataset download

Data mining

Dataset size (MB)

% of the total execution time

Page 43: Weka4WS : Enabling Distributed Data Mining on Grids

43

Performance considerations

� The performance analysis demonstrates the efficiency of

the WSRF mechanisms and Weka4WS as a means to

execute data mining tasks on remote machines

� In the LAN scenario neither the dataset download nor

the other steps represent a significant overhead with

respect to the total execution time

� In the WAN scenario, on the contrary, the dataset

download is a critical step that can affect the overall

execution time:

� for this reason, the use of data replication mechanisms and

high-performance file transfer protocols such as GridFTP

can be of great importance

Page 44: Weka4WS : Enabling Distributed Data Mining on Grids

44

Outline

� Goal of this work

� Weka4WS, WSRF and GT4

� Weka4WS architecture

� Software components

� Graphical user interface

� Web Services operations

� Execution mechanisms

� Performance analysis

� Conclusions

Page 45: Weka4WS : Enabling Distributed Data Mining on Grids

45

Conclusions (1)

� Gathering data in a central site raises several privacy

considerations that generally prevent centralized mining

of different multi-owner data sources.

� Privacy-preserving mining when data is distributed

among many sites can be implemented through the

cooperation of classifiers that learn global data mining

results without revealing the data sources available at

the single sites.

� Flexible middleware and services running in distributed

environments can help users to follow this approach.

Page 46: Weka4WS : Enabling Distributed Data Mining on Grids

46

Conclusions (2)

� Weka4WS can be used to implement distributed data

mining applications

� By exploiting such mechanisms, Weka4WS can provide

an effective way to perform privacy-preserving

distributed data analysis on large-scale Grids

� The experimental results demonstrate the efficiency of

the WSRF mechanisms as a means to execute data

mining tasks on remote resources

� A Weka4WS software prototype running under Globus

Toolkit 4.0.1 will be soon made available to the research

community

Page 47: Weka4WS : Enabling Distributed Data Mining on Grids

Thanks!THANKS