haimonti dutta 1 and hillol kargupta 2 1 center for computational learning systems (ccls), columbia...

27
Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore County, Baltimore, MD. Also affiliated to Agnik, LLC, Columbia, MD. Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments

Upload: cecily-boone

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Haimonti Dutta1 and Hillol Kargupta2

1Center for Computational Learning Systems (CCLS), Columbia University, NY,

USA.2University of Maryland, Baltimore County,

Baltimore, MD. Also affiliated to Agnik, LLC, Columbia, MD.

Distributed Linear Programming and Resource Management for Data Mining

in Distributed Environments

Page 2: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Motivation

Support Vector (Kernel) Regression An illustration

Support Vector Kernel Regression

Find a function f(x)=y to fit a set of example data points

Problem can be phrased as constrained optimization task

Solved using a standard LP solver

Page 3: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Motivation contd .. Knowledge Based Kernel RegressionIn addition to sample

points, give adviceIf (x ≥3) and (x ≤5)

Then (y≥5)Rules add constraints

about regionsConstraints added to LP

and a new solution (with advice constraints) can be constructed

Fung, Mangasarian and Shavlik,”Knowledge Based Support Vector Machine Classifiers”, NIPS, 2002.

Mangasarian, Shavlik and Wild, “Knowledge Based Kernel Approximation”, JMLR, 5, 1127 – 1141, 2005.

Figure adapted from McLain, Shavlik, Walker and Torrey, “Knowledge-based Support Vector Regression for Reinforcement Learning”, IJCAI, 2005

Page 4: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Distributed Data Mining Applications – An example of Scientific Data Mining in Astronomy

Distributed data and computing resources on the National Virtual Observatory

P2P Data Mining on homogeneously partitioned sky survey

H Dutta, Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure, Ph.D Thesis, UMBC, Maryland, 2007.

Need for distributed optimization strategies

Page 5: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Road MapMotivationRelated WorkFraming an Linear Programming problemThe simplex algorithmThe distributed simplex algorithmExperimental ResultsConclusion and Directions of Future Work

Page 6: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Related WorkResource Discovery in Distributed EnvironmentsImantichi, “Resource Discovery in Large Resource

Sharing Experiments”, Ph.D. Thesis, University of Chicago, 2003.

Livny and Solomon, “Matchmaking: Distributed Resource Management for high throughput computing”, HPDC, 1998.

Optimization TechniquesYarmish, “Distributed Implementation of the Simplex

Method”, Ph.D. Thesis, CIS Polytechnic University, 2001.Hall and McKinnon, “Update procedures for parallel

revised simplex methods, Tech Report, University of Edinburg, UK, 1992

Craig and Reed, “Hypercube Implementation of the Simplex Algorithm”, ACM, pages 1473 – 1482, 1998.

Page 7: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

The Optimization Problem

7

Assumptions:n nodes in the networkThe network is staticDataset Di at node iProcessing Cost at i-th node – νi per recordTransportation Cost between i and j – μij

Amount of Data Transferred between nodes – xij

Cost Function Z = Σij μij xij + νi xij = Σij cij xij

Page 8: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Framing the Linear Programming Problem: An illustration

Objective Function z = 6.03x12 +9.04x23 +6.52x15

+8.28x14 +14.42x25 + 9.58x34 + 12.32x45

Constraints C(X) = ∑ijµijxij + νjxij = ∑ijcijxij , Cij

= µij + νij x12 + x14 + x15 ≤ 300; x12 + x25 + x23 ≤ 600; x15+x25+x45 ≤ 300 ; x23+x34 ≤ 300; 0 ≤ x12 ≤ D1; 0 ≤ x23 ≤ D2; 0 ≤ x15 ≤ D1; 0 ≤ x14 ≤ D1; 0 ≤ x25 ≤ D2; 0 ≤ x34 ≤ D3; 0 ≤ x45 ≤ D4

5

600 GB

10.4

7.8

3

1

4

2

300 GB

300 GB

300 GB

300 GB

2.5

8.3

3.8

6.5

6.1

Node V

1 1.23

2 2.23

3 2.94

4 1.78

5 4.02

Page 9: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

The Simplex Algorithm

Find x1 ≥ 0, x2 ≥ 0, …. , xn ≥ 0 andMin z = c1 x1 + c2 x2 + …. + cn xn

Satisfying ConstraintsA1 x1 + A2 x2 + ….. + An xn = B

The Simplex Algorithm

a11 a12 …. a1n b1

a21 a22 …. a2n b2

…. …. …. …. ….

am1 am2 … amn bm

c1 c2 … cn zThe simplex tableau

Page 10: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

The Simplex Algorithm – Contd …

10

The ProblemMaximize z = x1 + 2x2 – x3

Subject to2x1+ x2+ x3 ≤ 14 4x1+2x2+3x3 ≤ 28 2x1+5 x2+5x3 ≤ 30

The Steps of the Simplex Algorithm (Dantzig)Obtain a canonical representation (Introduce Slack

Variables)Find a Column PivotFind a Row PivotPerform Gauss Jordan Elimination

Page 11: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

The simplex tableau and iterations

2 1 1 1 0 0 14

4 2 3 0 1 0 28

2 5 5 0 0 1 30

-1 -2 1 0 0 0 0

x1 x2 x3 s1 s2 s3 B

Pivot Column

Canonical Representation

14/1= 14

28/2=14

30/5= 6

Pivot Row

2 1 1 1 0 0 14

4 2 3 0 1 0 28

2 5 5 0 0 1 30

-1 -2 1 0 0 0 0

Page 12: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Simplex iterations contd …Perform Gauss

Jordan EliminationThe Final Tableau

8/5 0 0 1 0 -1/5 8

16/5 0 1 0 1 -2/5 16

2/5 1 1 0 0 1/5 6

-1/5 0 3 0 0 2/5 12

0 0 -1/2 1 -1/2 0 0

1 0 5/16 0 5/16 -1/8 5

0 1 7/8 0 -1/8 4 4

0 0 49/16 0 1/16 3/8 13

Page 13: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Road MapMotivationRelated WorkFraming an Linear Programming problemThe simplex algorithmThe distributed simplex algorithmExperimental ResultsConclusions and Future Work

Page 14: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

The Distributed Problem – An Example

14

Node1 Node 2

Node 5 Node 4

Node 3

x12+x15+x14+2x25≤300

x12+2x15-x25=2

300 GB

x12+x23+x25≤600

2x25-x12-x23=4

600 GB

x15+x25+x45≤300

x25-2x15-x45=5

300 GB

x34 +8 x25≤300

300 GB

x23+x34 ≤300

300 GB

Each site observes different constraints, but wants to solve the same objective function

z = 6.03x12 + 9.04x23 + 6.52x15 + 8.28x14 + 14.42x25 + 9.58x34 + 12.32x45

Page 15: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Distributed Canonical Representation

15

An initialization stepNo of basic variables to add = Total no of

constraints in the systemBuild a spanning tree in the networkPerform a distributed sum estimation

algorithm Builds a canonical representation exactly

identical to the one if data was centralized

Page 16: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

The Distributed Algorithm for solving the LP problem

16

Steps involved:Estimate Column pivotEstimate Row pivot (requires

communication with neighbors)Perform Gauss Jordan elimination

Page 17: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Illustration of the Distributed Algorithm

x12 x23 x15 x14 x25 x34 x45 s1 s2 s3 s4 s5 s6 s7 s8 B

1 0 1 1 2 0 0 1 0 0 0 0 0 0 0 300

1 0 2 0 -1 0 0 0 1 0 0 0 0 0 0 2

-6.03 -9.04 -6.52 -8.28 -14.42 -9.58 -12.32 0 0 0 0 0 0 0 0 0

Node1 Node 2

Node 5 Node 4

Node 3

x12 x23 x15 x14 x25 x34 x45 s1 s2 s3 s4 s5 s6 s7 s8 B

0 0 0 0 8 1 1 0 0 0 0 0 0 1 0 300

-6.03 -9.04 -6.52 -8.28 -14.42 -9.58 -12.32 0 0 0 0 0 0 0 0 0

Column pivot selection is done at each node

Page 18: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Distributed Row Pivot selection

18

Protocol Push Min (gossip based)Minimum estimation problemIteration t-1: {mr} values sent to node i

mti = min {{mr} , current row pivot}Termination: All nodes have exactly the

same minimum value

Page 19: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Analysis of Protocol Push Min

19

Based on spread of an epidemic in a large population

Suseptible, infected and dead nodesThe “epidemic” spreads exponentially fast

Node1 Node 2

Node 5 Node 4

Node 3

Page 20: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Comments and Discussions

20

Assume η no of nodes in the networkCommunication Complexity is

O(no of iterations of simplex X η)Worst case Simplex may require

exponential no of iterations.For most practical purposes it is λ m (λ<4)

Page 21: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Road MapMotivationRelated WorkFraming an Linear Programming problemThe simplex algorithmThe distributed simplex algorithmExperimental ResultsConclusion and Directions of Future Work

Page 22: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Experimental Results

Artificial Data SetSimulated constraint matrices at each nodeUsed Distributed Data Mining Toolkit (DDMT)

developed at University of Maryland, Baltimore County (UMBC) for simulating the network structure

Two different metrics for evaluation: TCC (Total Communication Cost in the network)Average Communication Cost per Node (ACCN)

Page 23: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Communication CostAverage Communication Cost Per Node

versus Number of Nodes in the network

Page 24: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

More Experimental Results ….TCC versus No of Variables at each node

TCC versus No of constraints at each node

Page 25: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Conclusions and Future Work

Resource management and pattern recognition present formidable challenges on distributed systems

Present a distributed algorithm for resource management based on the simplex algorithm

Test our algorithm on simulated data Future WorkIncorporation of dynamics of the networkTesting the algorithm on a real distributed networkEffect of size and structure of network on the

mining results Examine the trade-off between accuracy and

communication cost incurred before and after using distributed simplex on a mining task like classification or clustering

Page 26: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Selected BibliographyG.B.Dantzig, “Linear Programming and Extensions”.

Princeton University Press, Princeton, NJ, 1963Kargupta and Chan,”Advances in Distributed and

Parallel Knowledge Discovery”, AAAI Press, Menlo Park, CA, 2000.

A. L. Turinsky. “Balancing Cost and Accuracy in Distributed Data Mining”. PhD thesis, University of Illinois at Chicago., 2002.

Haimonti Dutta, “Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure”, Ph.D. Thesis, UMBC, 2007.

Mangasarian, “Mathematical Programming in Data Mining”, DMKD, Vol 42, pg 183 – 201, 1997.

Page 27: Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore

Questions ?