pier & phi overview of challenges & opportunities ryan huebsch † joe hellerstein † °,...
TRANSCRIPT
PIER & PHIOverview of Challenges & Opportunities
Ryan Huebsch†
Joe Hellerstein† °, Boon Thau Loo†,Sam Mardanbeigi†, Scott Shenker†‡, Ion
Stoica†
[email protected]†UC Berkeley, CS Division
‡International Computer Science Institute, Berkeley CA
°Intel Research Berkeley
STREAM DAY 5/7/04
PIER P2P Information Exchange & Retrieval
A wide-area distributed dataflow engine Outfitted with relational operators Designed to scale to thousands or millions of nodes
Motivation: It’s an interesting challenge Lowers the barrier of entry for large-scale applications
No massive infrastructure for server farms Cost is distributed among participants
Provide a viable solution where other options are not socially acceptable
We are NOT trying be better than other (centralized) solutions, we are trying to be different.
Challenges
Physical Network
Overlay Network
Query Plan
DeclarativeQueries
Query OptimizationMulti-Query Optimization
CatalogsPersistent Storage
Recursion
Query DisseminationReplicationSoft-State
Quality of Service
ResilienceRoute Flapping
Efficiency
SecurityPrivacyQuality of Service
GeneralChallenges
Applications & Requirements File sharing
Flooding works for popular items Need something better for rare items May want ‘triggers’ when a new item
matches an old search Network Monitoring
Aggregation & grouping very common Continuous queries with well defined
semantics
PHI is one use of PIER…
PHI Public Health for the Internet Community-based monitoring The metaphor:
Old way – Treat computers with medicine Virus protection
New way – Monitor the community Like the Center for Disease Control
Global CDC has social implications Central repository, privacy, who controls it, who pays
for it… PHI wants to create the Center for Disease
Control without the Center (of control) Motivation is to inform users about the dangers
of the Internet
PHI Example PIER is currently deployed on 150-300
PlanetLab nodes. ~100 sites Some nodes on DSL,
1Mbps, 10 Mbps, etc. Very unreliable
SNORT is the primary data source ~2400 rules 10’s - 1000’s of tuples per day per node Schema: time, rule, source socket, destination socket
Quick Demo: Shows the top ten sources of events across all of
PlanetLab (live), i.e. who are the bad guys?
What’s next… PIER
Lots of problems, including the meta-problem of what problem to work on
No streaming semantics, no language to describe windows, etc…
Additional challenges: Interaction with soft-state, no synchronized clocks, unknown (changing) network latencies
PHI Create a complete application
Gets intrusion data from a variety of sources (including the built-in Windows Firewall
Develop a snazzy visualization Release to the world, first using PlanetLab as the query
processor, eventually the world Scale to at least 10,000’s nodes and explore the
design space