a portal-based p2p system for the distribution and management of large data sets
DESCRIPTION
A Portal-based P2P System for the Distribution and Management of Large Data Sets. Rahim Lakhoo (Raz) and Prof Mark Baker ACET, University of Reading E-mail: [email protected] Web: http://acet.rdg.ac.uk/~rnl. Outline. Motivation. A Portal-based P2P System: High-level View, Overview, - PowerPoint PPT PresentationTRANSCRIPT
May, 07 [email protected]
A Portal-based P2P System for the Distribution and Management
of Large Data Sets
Rahim Lakhoo (Raz) and Prof Mark Baker
ACET, University of Reading E-mail: [email protected] Web: http://acet.rdg.ac.uk/~rnl
May, 07 [email protected]
Outline• Motivation.• A Portal-based P2P System:
– High-level View,– Overview,– Components.
• P2P Simulators:– Our requirements,– Simulators investigated,– Issues,– Experiences.
• Summary.• Conclusions.
May, 07 [email protected]
Motivation• Sloan Digital Sky Survey (SDSS) - uses a telescope to take optical images of the sky.
• Scientific projects such as SDDS are producing and working with very large data sets.
• Current methods for distributing the content involve:– Physically shipping disk drives,– Splitting and the point-to-point transfer from one location to another.
• Data sets are growing for projects like SDSS.– Currently, 5 Tbytes,– Set to be ~15 Tbytes by the end of the project.
• Storage and bandwidth is costly and limited, and the data sets will inevitably get larger.
• Managing and maintaining these large data sets is difficult, will will only become harder over time.
May, 07 [email protected]
Motivation• P2P is being used by normal people to download multimedia.
• A popular example is BitTorrent.• It’s success surrounds its protocol, which makes users share their bandwidth with other people trying to download the same file.
• BitTorrent Concepts:– Files are split into small pieces called ‘chunks’,– Chunks are seeded (uploaded) by a user,– Users download a ‘torrent’ file which has information about a file.
– A user loads the ‘torrent’ into an application which then downloads chunks from different peers,
– A ‘tracker’ tracks which peers have what chunks.
• Peer-to-Peer (P2P) systems offer a potential way to manage and distribute data sets.
May, 07 [email protected]
High-level View• Data sets such as SDSS are currently kept in a storage mechanism, such as a RAID array.
• A bootstrapping service is set up and has access to the SDSS data.
• The data is split into chunks and distributed to the Portal P2P services, hosted by different portals.
• Users who access the portal can contribute resources to help store and distribute the data. These are the Mini Peers.
• The Portal P2P services propagate the Mini Peers with parts of the data set.
• Any other project partners who want a copy of the data can join the P2P network and download parts of the data set from Portal and Mini Peers.
May, 07 [email protected]
Overview• Ideas are loosely based around the concepts of BitTorrent and Freenet.
• The P2P System consists of:– A distributed registry, for storing information for the network peers and also provides a tracker,
– A Bootstrapping Service, which splits the data set into chunks to be distributed by the peers,
– A Portal P2P Service, which provides storage and management of the data: • This service also propagates chunks to the Mini Peers.
– Mini Peers, donate bandwidth and disk space to the network.
May, 07 [email protected]
Overview
May, 07 [email protected]
Overview• The registry (VR) provides the distributed tracker: – A tracker helps peers locate other peers with chunks to download.
• The Bootstrapper initiates the propagation of the data set to the peers.
• The Portal P2P service manages the Mini Peers.
• The portal has management and monitoring tools for the data set.
• All peers volunteer resources to the P2P network.
May, 07 [email protected]
The Virtual Registry• The Virtual Registry (VR) is provided by Tycho.
• Tycho is a wide-area asynchronous message passing system with a integrated distributed registry.
• The VR can store information which can be searched and retrieved by peers on the network.
• Tycho uses HTTP/HTTPS,Sockets/SSL for communications.
• The VR will provide the distributed P2P tracker service, for finding peers with chunks to download.
May, 07 [email protected]
The Virtual Registry
May, 07 [email protected]
The Virtual Registry• Tycho has a Service Oriented Architecture that uses the concept of producers and consumers.
• In our system, each Tycho mediator has a consumer and producer, for communications.
• Mediators provide the VR with a distributed data store, which uses HSQLDB as its database.
• Local communications are via Sockets/SSL and wide-area communications via HTTP/HTTPS.
May, 07 [email protected]
The Bootstrapper
• A bootstrapping service is needed to propagate the Portal P2P service with parts of the data set.
• This service splits the data set into chunks.
• Each chunk has an associated hash value, which is stored in the Virtual Registry.
• The bootstrapping service needs access to the original data set(s).
May, 07 [email protected]
The Bootstrapper
B o o t S t r a p p e r
O r i g i n a l
D a t a S e t
H S Q L D B
P 2 P
D i s t r i b u t i o n
S e r v i c e
T y h c o
S t o r e d H a s h V a l u e s
V i r t u a l R e g i s t r y
S
o
c
ke
t/
S
S
LS
o
ck
e
t/
S
S
L
H
T
T
P
/H
T
T
P
S
P o r t a l
P o r t a l
H
T
T
P
/H
T
T
P
S
H
T
T
P
/H
T
T
P
S
May, 07 [email protected]
The Bootstrapper
• The bootstrapping service needs to propagate different chunks to different Portals concurrently.
• Hash values and metadata about the data set and chunks is stored in the VR.
• This service is also used if a requested chunk that is not found on the P2P network, due to chunk corruption. In this case, the missing chunk needs to be replaced in the P2P system.
May, 07 [email protected]
The Portal P2P Service
• The Portal P2P service is a plug-in component for portals.
• This service stores and serves chunks of the data set to other peers in the network.
• The portal service propagates chunks to the Mini peers.
• The monitoring and management of the data set is handled by the portlet tools and the P2P service.
• The portal service uses Tycho to synchronise management tools across all portals in the network.
May, 07 [email protected]
The Portal P2P Service
P o r t a l w i t h
m a n a g e m e n t t o o l s .
P 2 P S e r v i c e
T y h c o
H S Q L D B
S t o r a g e
( C h u n k s )
H o s t i n g P a r t n e r
V i r t u a l R e g i s t r y
P 2 P S w a r m
C o l l e c t i o n
o f M i n i
P e e r s
S
o
c
k
e
t/
S
S
L
S
o
c
k
e
t/
S
S
L
S oc k e
t / SS L
P o r t a l w i t h
m a n a g e m e n t t o o l s .
P 2 P S e r v i c e
T y h c o
H S Q L D B
S t o r a g e
( C h u n k s )
H o s t i n g P a r t n e r
HT
TP
/H
TT
P
S
C o l l e c t i o n
o f M i n i
P e e r s
S
o
c
k
e
t/
S
S
L
So
c ke
t / SS
L
S
o
ck
e
t/S
S
L
H
TT
P/
HT
TP
S
May, 07 [email protected]
The Portal P2P Service
• Each Portal P2P service needs access to a storage mechanism, for parts of the data set.
• The storage resources provided by the portals provides space for a copy of the large data set.
• The Portal P2P service also provides parts of the data set to other peers in the P2P network.
• The Portal provides users with an environment for managing and monitoring the data set collaboratively between peers.
May, 07 [email protected]
The Mini Peers
• Mini peers donate bandwidth and storage space to the network.
• Mini peers will interact with the P2P network via their Web browser.
• Mini peers will store chunks that are useful for other peers.
• Mini peers aim to help other peers download and distribute the data set.
May, 07 [email protected]
The Mini PeersP o r t a l P o r t a l
P 2 P S w a r m
S o c k e t / S S L S o c k e t / S S L
M i n i P e e r
M i n i P e e r
M i n i P e e r
M i n i P e e r
M i n i P e e r
M i n i P e e r
S
o
c
k
e
t/
S
S
L
S
o
c
k
e
t/S
S
L
S
o
c
k
e
t/S
S
L
S
o
c
k
e
t/S
S
L
S
o
c
k
e
t/
S
S
L
M i n i P e e r
W e b B r o w s e r
P 2 P
S e r v i c e
S t o r a g e
( c h u n k s )
T y h c o
May, 07 [email protected]
The Mini Peers
• Client-side Web browser technologies such as Ajax and JavaScript, will be used for the Mini Peer.
• They will utilise the VR to publish parts of the data set, to share with other peers in the network.
• Mini Peers will store chunks locally on a users machine.
May, 07 [email protected]
P2P Simulators - Requirements• We wanted to use a simulator to help test and develop our P2P system with greater assurance.
• Running the P2P system in a simulator would allow us to configure scenarios for studying system behaviour.
• Our requirements for a simulator were:– Have support for customised P2P protocols,– Provide facilities for hierarchical topologies,– Provide visualisations,– Provide reasonably accurate results in terms of ‘real-world’ performance,
– Have good support and documentation,– Be capable of interfacing with the Java.
May, 07 [email protected]
P2P Simulators• There are many network simulators, some are more suited to P2P then others.
• Simulators investigated include:– NS-2 with NAM,– PeerSim,– PlanetSim,– OMNet++ and OverSim,– General Purpose Simulator (GPS),– AgentJ,– P2PSim.
May, 07 [email protected]
Issues• We short listed three simulators:
– General Purpose Simulator (GPS),– AgentJ,– OverSim.
• GPS – Difficult to implement our own protocol as the simulator is tightly coupled to the BitTorrent protocol,
– Stability issues were seen with larger simulations.
• AgentJ – Requires a normal Java application,– Does not support TCP in the simulation environment.
• OverSim– Java support is limited and restricting. It is not possible to implement a whole simulation with the provided Java support.
May, 07 [email protected]
Experiences• No simulator completely fulfilled our requirements.
• We could not successfully implement our Portal-based P2P system in these simulators.
• Some of the simulators are complex and take extensive time to learn.
• Stability issues were seen with some of the simulators.
• Code written for a simulation is specific to a particular simulator. The code cannot be reused in the later stages of development.
• The time taken to implement our P2P system in a simulator, does not merit many advantages.
May, 07 [email protected]
Summary• We are developing a Portal-based P2P system to help the
scientific community to manage, store and distribute large data sets.
• Our Portal-based P2P system introduces the concept of data sets being collaboratively downloaded and managed.
• The Portal-based P2P system has four main components:– Virtual Registry,– Bootstrapping service,– Portal P2P service,– Mini peers.
• We attempted to simulate our design and idea with one of the P2P simulators.
• We have investigated and tested several P2P simulators for their suitability to emulate our design.
• We found that the simulators we studied we inflexible, unstable, and not easy to use - basically we would have spent more time fixing them, than actually physically implementing and testing our design on a cluster.
May, 07 [email protected]
Conclusions• Distributing and managing large data sets is difficult for projects such as SDSS.
• P2P simulators are not as useful as first thought.
• We will implement our Portal-based P2P system and test it on a suitable test bed, i.e. a cluster.
• Once the development of our P2P system has reached a suitable stage, we may consider systems such as PlanetLab. – PlanetLab provides time on a real network with 100’s of nodes, hosted by academic institutes.
• P2P systems are known to be an efficient way to distribute files and are becoming increasingly popular.
• Implementation should be at a suitable stage for preliminary testing in a few months.
May, 07 [email protected]
References
Tycho - http://acet.rdg.ac.uk/projects/tychoFurther Information - http://acet.rdg.ac.uk/projects/vre/docs.php