1
Scoped and Approximate Queries in a Relational Grid Information Service
Dong Lu , Peter A. Dinda , Jason A. Skicewicz
Prescience Lab, Dept. of Computer Science
Northwestern University, Evanston, IL 60201
2
Outline• Introduction and motivation
– Powerful queries, but expensive to execute
– Trade off between result size and query time
• Our solutions: Scoped query, Approximate query, Scoped Approximate query– Nondeterministic query (SC Talk on Tuesday)
• Performance Evaluation
3
What is RGIS?
• GIS: A Grid Information Service stores information about the resources and services in a distributed computing environment and answer queries about it.
• RGIS: Grid Information Service based on relational data model.
4
Why RGIS?1. RGIS can answer complex compositional queries
• Relational algebra (SQL)• Joins
• Difficult in a hierarchical model (directory service)
2. Other reasons• Indexes separate from data model• Schema evoluation • Transactional insert/update/delete• Consistency
5
RGIS Model of a Gridmodule
endpoint
maclinkmacswitch
iplinkrouter
host
connectorswitch
connectorlink
• Annotated network topology graph
• Annotation examples– Hosts: memory, disk, OS,
NICs, etc.– Router/Switch: backplane
bandwidth, ports– Link: latency and bandwidth
• Highly dynamic data in streams, not DB
• Virtualization, Futures, Leases– Virtual machines
Network
Data link
Physical
Software
6
The RGIS Design (Per Site)
Oracle 9i Back EndWindows, Linux, Parallel Server, etc
Oracle 9i Front Endtransactional inserts and updates
using stored procedures, queries using select statements
(uses database’s access control)
UpdateManager
Web Interface
Content Delivery Network Interface
For loose consistency
Query Managerand Rewriter
Users
Schema, type hierarchy, indices,PL/SQL stored procedures
for each object
Applications
RDBMSUse of Oracle
is not a requirement of approach
site-to-site (tentative)
Updates encrypted using asymmetric cryptography on network. Only those with appropriate keys have access
Authenticated Direct Interface
SOAP Interface
7
Challenge/Trade off
• Complex queries to a relational database can take a long time, – Hours, days or even weeks when we want seconds.
• Typically, returned result set is unnecessarily big.– Get back all results
• We need mechanisms to trade off the query time with the size of result set.
8
Challenge/Trade off
All results
Scopedresults
Nondeterministicresults
Approximateresults
9
Example: Cluster Finder
Cluster
RoutersIP links
Hosts
Find N hosts connected to the same router, with total memory N*512 MB, all running Linux, and the bisection bandwidth of The cluster is no less than 100Mbits/sec.
10
Original SQL for 2 Host Cluster FinderSELECT [scoped-approx] h1.distip, h2.distip FROM hosts h1, hosts h2, iplinks l1, iplinks l2, routers r WHERE h1.mem_mb+h2.mem_mb>=1024 and h1.os='linux' and h2.os='linux' and ((l1.src=r.distip and l2.src=r.distip and l1.dest=h1.distip and l2.dest=h2.distip) or (l1.dest=r.distip and l2.dest=r.distip and l1.src=h1.distip and l2.src=h2.distip)) and h1.distip<>h2.distip and L1.BW_MBS >= 100 AND L2.BW_MBS >= 100[SCOPED BY r.distip=X]WITHIN 100 seconds; Original
11
Original SQL for Cluster Finder
• It is 2*N+1 way join to look for a N node cluster. Not scalable.
Cluster 2
RoutersIP links
Hosts
Cluster 1
12
Scoped Cluster Finder
RoutersIP links
Hosts
Query the hosts around a random router.
13
Scoped Cluster Finder SELECT H1.DISTIP, H2.DISTIP FROM HOSTS H1, HOSTS H2, IPLINKS L1, IPLINKS L2, ROUTERS R WHERE H1.MEM_MB+H2.MEM_MB>=1024 AND H1.OS='LINUX' AND H2.OS='LINUX' AND ((L1.SRC=R.DISTIP AND L2.SRC=R.DISTIP AND L1.DEST=H1.DISTIP AND L2.DEST=H2.DISTIP) OR (L1.DEST=R.DISTIP AND L2.DEST=R.DISTIP AND L1.SRC=H1.DISTIP AND L2.SRC=H2.DISTIP)) AND H1.DISTIP<>H2.DISTIP AND L1.BW_MBS >= 100 AND L2.BW_MBS >= 100 AND R.DISTIP = X; Scoped
14
Approximate Cluster Finder
• When searching for N hosts with total memory N*512, we can approximate the query with “search for N hosts with each having memory over 512”.
• Thus reduced or avoided the number of joins.
• However, this won’t find, say, N/2 hosts with 256 MB and N/2 hosts with 768 MB
15
Approximate Cluster FinderSELECT R.DISTIP, H1.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS R WHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS >= 100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) AND R.DISTIP IN (SELECT R.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS R WHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS>=100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) GROUP BY R.DISTIP HAVING COUNT(*) >= 2) ORDER BY R.DISTIP;
16
Scoped Approximate Cluster Finder
• Combine approximate query with scoped query.
• Scoped to one randomly chosen router at a time, if no results found, choose another random router and repeat the query.
• Approximate N host join for 512*N memory with searches for N hosts each with >=512.
• Always a THREE way join.– regardless of the size of the cluster being searched
for. Thus very scalable. – may need to search multiple routers.
17
Scoped Approximate Cluster Finder
SELECT H1.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS RWHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS >= 100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) AND R.DISTIP=X AND ROWNUM <=2
The scoped approximate cluster finder has a fixed number of joins.
18
Time bounded queries
• The query rewriter will start the query as a child process.
• Parent kills the child process if no results returned within deadline.
19
Limitations of Scoped and Approximate queries
• The returned results are subset of original query, and it is possible to report no results while the original query could return results after running a long time.
• Not all queries can be written as Scoped or Approximate queries.
• It is hard to automate the Scoped and Approximate query rewriting.
20
Performance Evaluation
• Need to populate the database with large amount of data.
• Computational grids are still in early stages. – No large data sets available.– Use Smith MDS data for memory
• We generate synthetic grids that are representative of the Internet.– Can generate very large grids
21
GridG Generated Synthetic Grids
• Three-level network: WAN, MAN, LAN. Nodes on WAN, MAN are routers, while nodes on LAN are hosts.
• Links: IP links annotated with bandwidth and latency.
• Hosts: annotated with memory size, architecture, number of processors, CPU clock rate, disk size, etc.
• User can control all the distributions and the size of network.
22
GridG: Synthesing Realistic Computational Grids
http://www.cs.northwestern.edu/~urgis/GridG
Other transformationson common format(Cluster maker, etc)
Structured TopologyBase
TopologyGenerator
(Tiers)
TranslationTo
CommonFormat
GridGPowerLaw
Enforcer
Structured Topologythat obeys power laws
Grid
GridGAnnotator
GISSimulator
DOTVisualization
OtherTools
RGISDatabase
SC talk on Tuesday!
23
Experimental Setup
• Dell PowerEdge 4400: dual Xeon 1 GHz processors, 2 GB memory, 240 GB RAID 5 storage system.
• Oracle 9i Enterprise edition, red hat Linux 7.1.
• Each test is repeated either 25 or 100 times, and we provide the average value.
24
Performance of various Query Technique with Cluster Finder
Cluster size | Standard | Scoped | Approx | Scoped Approx
2 | 21.44 | 2.27 | 7.62 | 1.16
4 | >7200 | 2047.9 | 7.48 | 1.32
8 | >9000 | >3600 | 7.46 | 1.43
16 | N/A | >3600 | 7.51 | 1.45
32 | N/A | >3600 | 7.65 | 5.96
64 | N/A | >3600 | >120 | 9.58
(Time to run query in Seconds)
25
Performance of Scoped Approximate Queries
• Cluster Finder : Find N hosts, each running Linux, with total memory at least N*512 MB, all connected to the same router, the bisection width is at least 100Mbits.– Our running example
• Non network query : Find N hosts with total memory at least N*512 MB.– No joins needed at all
26
Performance of Scoped Approximate Queries (2)
• Scalability with database size.
• Scalability with the complexity of queries.
• Scalability with concurrent users and update load.
27
Performance of Scoped Approximate Query (9.8K hosts, Cluster Finder)
28
Performance of Scoped Approximate Query (101K hosts , Cluster Finder)
29
Performance of Scoped Approximate Query (980K hosts , Cluster Finder)
30
Performance of Scoped Approximate Query (9.8K hosts, Non-network query)
31
Performance of Scoped Approximate Query (101K hosts , Non-network query)
32
Performance of Scoped Approximate Query (980K hosts , Non-network query)
33
Scalability with multiple concurrent users and background load
• Other research has shown that GIS servers will undertake frequent updating while serving the requests.
• GIS servers serve multiple concurrent users.• Evaluate scoped approximate queries with concurrent
users and update load.• Concurrent users: execute queries repeatedly• The update load: execute transactional updates on
randomly selected hosts as fast as possible.– About 200 updates/second
34
Performance of Scoped Approximate Query (9.8K hosts , Cluster Finder, with Concurrent
Users, looking for 64 nodes)
35
Performance of Scoped Approximate Query (9.8K hosts , Non network query, with
Concurrent Users, looking for 64 nodes)
36
Conclusions
• Described and evaluated two query techniques to trade off query time with the size of result set: Scoped and Approximate query.
• Combination of Scoped and Approximate query can dramatically reduce response time and server load.
37
For more information
• GridG and Related paper: http://www.cs.northwestern.edu/~urgis/GridG
“Synthesizing Realistic Computational Grids”, In proceedings of SC03.
• RGIS and Related paper: http://www.cs.northwestern.edu/~urgis/
“Nondeterministic Queries in a Relational Grid Information Service”, In proceedings of SC03.