a portal-based p2p system for the distribution and management of large data sets

28
May, 07 [email protected] A Portal-based P2P System for the Distribution and Management of Large Data Sets Rahim Lakhoo (Raz) and Prof Mark Baker ACET, University of Reading E-mail: [email protected] Web: http://acet.rdg.ac.uk/~rnl

Upload: tyra

Post on 31-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

A Portal-based P2P System for the Distribution and Management of Large Data Sets. Rahim Lakhoo (Raz) and Prof Mark Baker ACET, University of Reading E-mail: [email protected] Web: http://acet.rdg.ac.uk/~rnl. Outline. Motivation. A Portal-based P2P System: High-level View, Overview, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

A Portal-based P2P System for the Distribution and Management

of Large Data Sets

Rahim Lakhoo (Raz) and Prof Mark Baker

ACET, University of Reading E-mail: [email protected] Web: http://acet.rdg.ac.uk/~rnl

Page 2: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

Outline• Motivation.• A Portal-based P2P System:

– High-level View,– Overview,– Components.

• P2P Simulators:– Our requirements,– Simulators investigated,– Issues,– Experiences.

• Summary.• Conclusions.

Page 3: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

Motivation• Sloan Digital Sky Survey (SDSS) - uses a telescope to take optical images of the sky.

• Scientific projects such as SDDS are producing and working with very large data sets.

• Current methods for distributing the content involve:– Physically shipping disk drives,– Splitting and the point-to-point transfer from one location to another.

• Data sets are growing for projects like SDSS.– Currently, 5 Tbytes,– Set to be ~15 Tbytes by the end of the project.

• Storage and bandwidth is costly and limited, and the data sets will inevitably get larger.

• Managing and maintaining these large data sets is difficult, will will only become harder over time.

Page 4: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

Motivation• P2P is being used by normal people to download multimedia.

• A popular example is BitTorrent.• It’s success surrounds its protocol, which makes users share their bandwidth with other people trying to download the same file.

• BitTorrent Concepts:– Files are split into small pieces called ‘chunks’,– Chunks are seeded (uploaded) by a user,– Users download a ‘torrent’ file which has information about a file.

– A user loads the ‘torrent’ into an application which then downloads chunks from different peers,

– A ‘tracker’ tracks which peers have what chunks.

• Peer-to-Peer (P2P) systems offer a potential way to manage and distribute data sets.

Page 5: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

High-level View• Data sets such as SDSS are currently kept in a storage mechanism, such as a RAID array.

• A bootstrapping service is set up and has access to the SDSS data.

• The data is split into chunks and distributed to the Portal P2P services, hosted by different portals.

• Users who access the portal can contribute resources to help store and distribute the data. These are the Mini Peers.

• The Portal P2P services propagate the Mini Peers with parts of the data set.

• Any other project partners who want a copy of the data can join the P2P network and download parts of the data set from Portal and Mini Peers.

Page 6: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

Overview• Ideas are loosely based around the concepts of BitTorrent and Freenet.

• The P2P System consists of:– A distributed registry, for storing information for the network peers and also provides a tracker,

– A Bootstrapping Service, which splits the data set into chunks to be distributed by the peers,

– A Portal P2P Service, which provides storage and management of the data: • This service also propagates chunks to the Mini Peers.

– Mini Peers, donate bandwidth and disk space to the network.

Page 7: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

Overview

Page 8: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

Overview• The registry (VR) provides the distributed tracker: – A tracker helps peers locate other peers with chunks to download.

• The Bootstrapper initiates the propagation of the data set to the peers.

• The Portal P2P service manages the Mini Peers.

• The portal has management and monitoring tools for the data set.

• All peers volunteer resources to the P2P network.

Page 9: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Virtual Registry• The Virtual Registry (VR) is provided by Tycho.

• Tycho is a wide-area asynchronous message passing system with a integrated distributed registry.

• The VR can store information which can be searched and retrieved by peers on the network.

• Tycho uses HTTP/HTTPS,Sockets/SSL for communications.

• The VR will provide the distributed P2P tracker service, for finding peers with chunks to download.

Page 10: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Virtual Registry

Page 11: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Virtual Registry• Tycho has a Service Oriented Architecture that uses the concept of producers and consumers.

• In our system, each Tycho mediator has a consumer and producer, for communications.

• Mediators provide the VR with a distributed data store, which uses HSQLDB as its database.

• Local communications are via Sockets/SSL and wide-area communications via HTTP/HTTPS.

Page 12: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Bootstrapper

• A bootstrapping service is needed to propagate the Portal P2P service with parts of the data set.

• This service splits the data set into chunks.

• Each chunk has an associated hash value, which is stored in the Virtual Registry.

• The bootstrapping service needs access to the original data set(s).

Page 13: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Bootstrapper

B o o t S t r a p p e r

O r i g i n a l

D a t a S e t

H S Q L D B

P 2 P

D i s t r i b u t i o n

S e r v i c e

T y h c o

S t o r e d H a s h V a l u e s

V i r t u a l R e g i s t r y

S

o

c

ke

t/

S

S

LS

o

ck

e

t/

S

S

L

H

T

T

P

/H

T

T

P

S

P o r t a l

P o r t a l

H

T

T

P

/H

T

T

P

S

H

T

T

P

/H

T

T

P

S

Page 14: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Bootstrapper

• The bootstrapping service needs to propagate different chunks to different Portals concurrently.

• Hash values and metadata about the data set and chunks is stored in the VR.

• This service is also used if a requested chunk that is not found on the P2P network, due to chunk corruption. In this case, the missing chunk needs to be replaced in the P2P system.

Page 15: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Portal P2P Service

• The Portal P2P service is a plug-in component for portals.

• This service stores and serves chunks of the data set to other peers in the network.

• The portal service propagates chunks to the Mini peers.

• The monitoring and management of the data set is handled by the portlet tools and the P2P service.

• The portal service uses Tycho to synchronise management tools across all portals in the network.

Page 16: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Portal P2P Service

P o r t a l w i t h

m a n a g e m e n t t o o l s .

P 2 P S e r v i c e

T y h c o

H S Q L D B

S t o r a g e

( C h u n k s )

H o s t i n g P a r t n e r

V i r t u a l R e g i s t r y

P 2 P S w a r m

C o l l e c t i o n

o f M i n i

P e e r s

S

o

c

k

e

t/

S

S

L

S

o

c

k

e

t/

S

S

L

S oc k e

t / SS L

P o r t a l w i t h

m a n a g e m e n t t o o l s .

P 2 P S e r v i c e

T y h c o

H S Q L D B

S t o r a g e

( C h u n k s )

H o s t i n g P a r t n e r

HT

TP

/H

TT

P

S

C o l l e c t i o n

o f M i n i

P e e r s

S

o

c

k

e

t/

S

S

L

So

c ke

t / SS

L

S

o

ck

e

t/S

S

L

H

TT

P/

HT

TP

S

Page 17: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Portal P2P Service

• Each Portal P2P service needs access to a storage mechanism, for parts of the data set.

• The storage resources provided by the portals provides space for a copy of the large data set.

• The Portal P2P service also provides parts of the data set to other peers in the P2P network.

• The Portal provides users with an environment for managing and monitoring the data set collaboratively between peers.

Page 18: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Mini Peers

• Mini peers donate bandwidth and storage space to the network.

• Mini peers will interact with the P2P network via their Web browser.

• Mini peers will store chunks that are useful for other peers.

• Mini peers aim to help other peers download and distribute the data set.

Page 19: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Mini PeersP o r t a l P o r t a l

P 2 P S w a r m

S o c k e t / S S L S o c k e t / S S L

M i n i P e e r

M i n i P e e r

M i n i P e e r

M i n i P e e r

M i n i P e e r

M i n i P e e r

S

o

c

k

e

t/

S

S

L

S

o

c

k

e

t/S

S

L

S

o

c

k

e

t/S

S

L

S

o

c

k

e

t/S

S

L

S

o

c

k

e

t/

S

S

L

M i n i P e e r

W e b B r o w s e r

P 2 P

S e r v i c e

S t o r a g e

( c h u n k s )

T y h c o

Page 20: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

The Mini Peers

• Client-side Web browser technologies such as Ajax and JavaScript, will be used for the Mini Peer.

• They will utilise the VR to publish parts of the data set, to share with other peers in the network.

• Mini Peers will store chunks locally on a users machine.

Page 21: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

P2P Simulators - Requirements• We wanted to use a simulator to help test and develop our P2P system with greater assurance.

• Running the P2P system in a simulator would allow us to configure scenarios for studying system behaviour.

• Our requirements for a simulator were:– Have support for customised P2P protocols,– Provide facilities for hierarchical topologies,– Provide visualisations,– Provide reasonably accurate results in terms of ‘real-world’ performance,

– Have good support and documentation,– Be capable of interfacing with the Java.

Page 22: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

P2P Simulators• There are many network simulators, some are more suited to P2P then others.

• Simulators investigated include:– NS-2 with NAM,– PeerSim,– PlanetSim,– OMNet++ and OverSim,– General Purpose Simulator (GPS),– AgentJ,– P2PSim.

Page 23: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

Issues• We short listed three simulators:

– General Purpose Simulator (GPS),– AgentJ,– OverSim.

• GPS – Difficult to implement our own protocol as the simulator is tightly coupled to the BitTorrent protocol,

– Stability issues were seen with larger simulations.

• AgentJ – Requires a normal Java application,– Does not support TCP in the simulation environment.

• OverSim– Java support is limited and restricting. It is not possible to implement a whole simulation with the provided Java support.

Page 24: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

Experiences• No simulator completely fulfilled our requirements.

• We could not successfully implement our Portal-based P2P system in these simulators.

• Some of the simulators are complex and take extensive time to learn.

• Stability issues were seen with some of the simulators.

• Code written for a simulation is specific to a particular simulator. The code cannot be reused in the later stages of development.

• The time taken to implement our P2P system in a simulator, does not merit many advantages.

Page 25: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

Summary• We are developing a Portal-based P2P system to help the

scientific community to manage, store and distribute large data sets.

• Our Portal-based P2P system introduces the concept of data sets being collaboratively downloaded and managed.

• The Portal-based P2P system has four main components:– Virtual Registry,– Bootstrapping service,– Portal P2P service,– Mini peers.

• We attempted to simulate our design and idea with one of the P2P simulators.

• We have investigated and tested several P2P simulators for their suitability to emulate our design.

• We found that the simulators we studied we inflexible, unstable, and not easy to use - basically we would have spent more time fixing them, than actually physically implementing and testing our design on a cluster.

Page 26: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

Conclusions• Distributing and managing large data sets is difficult for projects such as SDSS.

• P2P simulators are not as useful as first thought.

• We will implement our Portal-based P2P system and test it on a suitable test bed, i.e. a cluster.

• Once the development of our P2P system has reached a suitable stage, we may consider systems such as PlanetLab. – PlanetLab provides time on a real network with 100’s of nodes, hosted by academic institutes.

• P2P systems are known to be an efficient way to distribute files and are becoming increasingly popular.

• Implementation should be at a suitable stage for preliminary testing in a few months.

Page 27: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

References

Tycho - http://acet.rdg.ac.uk/projects/tychoFurther Information - http://acet.rdg.ac.uk/projects/vre/docs.php

Page 28: A Portal-based P2P System for the Distribution and Management of Large Data Sets

May, 07 [email protected]

Thank you for listening

Questions?